This repo contains the implementations of the Disaster Response Pipeline project, which is part of the Udacity Data Scientist Program.
This project is part of the Udacity Data Scientist Program (Data Engineering). I followed the instructions to build ETL, NLP and machine learning pipelines during the course, and used the code and skills learned to complete this project.
Please note that to reproduce the results, you need to install the exact versions of the libraries using the pip:
pip install -r requirements.txt
I recommend to install into a virtual environment like Anaconda.
The folder structure of this repo is as follows:
|-app
|-templates # html templates
|-run.py # script to run web demo
|-data
|-disaster_categories.csv # original categories dataset
|-disaster_messages.csv # original messages dataset
|-process_data.py # ETL pipeline script
|-ETL Pipeline Preparation.ipynb # the ETL pipeline preparation notebook
|-models
|-train_classifier.py # NLP&ML pipeline script
|-ML Pipeline Preparation.ipynb # the NLP&ML pipeline preparation notebook
- Run the following commands in the project's root directory to set up your database and model.
- To run ETL pipeline that cleans data and stores in database
python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
- To run ML pipeline that trains classifier and saves
python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl
-
Go to
app
directory:cd app
-
Run your web app:
python run.py
-
Click the
PREVIEW
button to open the homepage -
Enter a message and click the green "Classify Message" button to see the results
This implementation aims to record my learning path and made as a homework and contains several limitations:
- The dataset is heavily unbalanced. Some categories have very small number of positive samples (even 0 positive sample). For categories with 0 positive samples, the classification results are meaningless. For categories with small amount of positive samples, other technics like over-sampling need to be introduced during the training process.
- The model parameters could be further tuned. Since Udacity offers limited on-line computing resources, the grid search had to be done in a small space. More optimized parameters could be retrived with added computing power.