Giter Site home page Giter Site logo

dsc-mod-3-project-seattle-ds-012720's Introduction

Module 3 Final Project

Congratulations! You've made it through another intense module, and now you're ready to show off your newfound Machine Learning skills!

awesome

All that remains for Module 3 is to complete the final project!

The Project

The main goal of this project is to create a classification model. For this project you have the choice to either:

  • choose a data set from a curated list
  • choose your own data set outside of the curated list.

The data guidelines for either option are shown below

For this project, you're going to select a dataset of your choosing and create a classification model. You'll start by identifying a problem you can solve with classification, and then identify a dataset. You'll then use everything you've learned about Data Science and Machine Learning thus far to source a dataset, preprocess and explore it, and then build and interpret a classification model that answers your chosen question.

a. Choosing the data from a curated list

You are allowed to select one of the four data sets described below. Each comes with its own advantages and disadvantages, and, of course, its own associated business problem and stakeholders. It may be desirable to flesh out your understanding of the audience or the business proposition a little more than sketched out here. If you select one of these four data sets, you need no further approval from your instructor.

  1. Chicago Car Crash Data. Note this links also to Vehicle Data and to Driver/Passenger Data.

    Build a classifier to predict the primary contributory cause of a car accident, given information about the car, the people in the car, the road conditions etc. You might imagine your audience as a Vehicle Safety Board who's interested in reducing traffic accidents, or as the City of Chicago who's interested in becoming aware of any interesting patterns. Note that there is a multi-class classification problem. You will almost certainly want to bin or trim or otherwise limit the number of target categories on which you ultimately predict. Note e.g. that some primary contributory causes have very few samples.

  2. Customer Churn Data

    Build a classifier to predict whether a customer will ("soon") stop doing business with SyriaTel, a telecommunications company. Note that this is a binary classification problem.

    Most naturally, your audience here would be the telecom business itself, interested in losing money on customers who don't stick around very long. Are there any predictable patterns here?

  3. Tanzanian Water Well Data (active competition!)

    Tanzania, as a developing country, struggles with providing clean water to its population of over 57,000,000. There are many waterpoints already established in the country, but some are in need of repair while others have failed altogether.

    Build a classifier to predict the condition of a water well, using information about the sort of pump, when it was installed, etc. Note that this is a ternary classification problem.

b. Selecting a Data Set Outside of the Curated List

We encourage you to be very thoughtful when identifying your problem and selecting your data set--an overscoped project goal or a poor data set can quickly bring an otherwise promising project to a grinding halt. If you are going to choose your own data set, you'll need to run it by your instructor for approval.

To help you select an appropriate data set for this project, we've set some guidelines:

  1. Your dataset should work for classification. The classification task can be either binary or multiclass, as long as it's a classification model.

  2. Your dataset needs to be of sufficient complexity. Try to avoid picking an overly simple dataset. Try to avoid extremely small datasets, as well as the most common datasets like titanic, iris, MNIST, etc. We want to see all the steps of the Data Science Process in this project--it's okay if the dataset is mostly clean, but we expect to see some preprocessing and exploration. See the following section, Data Set Constraints, for more information on this.

  3. On the other end of the spectrum, don't pick a problem that's too complex, either. Stick to problems that you have a clear idea of how you can use machine learning to solve it. For now, we recommend you stay away from overly complex problems in the domains of Natural Language Processing or Computer Vision--although those domains make use of Supervised Learning, they come with a lot of other special requirements and techniques that you don't know yet (but you'll learn soon!). If you're chosen problem feels like you've overscoped, then it probably is. If you aren't sure if your problem scope is appropriate, double check with your instructor!

Data Set Constraints

When selecting a data set, be sure to take into consideration the following constraints:

  1. Your data set can't be one we've already worked with in any labs.
  2. Your data set should contain a minimum of 1000 rows.
  3. Your data set should contain a minimum of 10 predictor columns, before any one-hot encoding is performed.
  4. Your instructor must provide final approval on your data set.

Problem First, or Data First?

There are two ways that you can about getting started: Problem-First or Data-First.

Problem-First: Start with a problem that you want to solve with classification, and then try to find the data you need to solve it. If you can't find any data to solve your problem, then you should pick another problem.

Data-First: Take a look at some of the most popular internet repositories of cool data sets we've listed below. If you find a data set that's particularly interesting for you, then it's totally okay to build your problem around that data set.

There are plenty of amazing places that you can get your data from. We recommend you start looking at data sets in some of these resources first:

DELIVERABLES

  1. A public GitHub repository.
  2. An environment.yml file that contains all the necessary packages needed to recreate your conda environment.
  3. A standalone src/ directory that stores all relevant source code.
    • All functions have docstrings that act as professional-quality documentation.
    • If applicable, well documented SQL queries with appropriate single-line or multiline comments.
    • Quality classification model
      • Whenever necessary, briefly explain in comments the changes made from one iteration to the next, and why you made these choices
  4. A standalone data/ directory that stores all relevant raw and processed data files
    • Be sure to include how the data was obtained!
    • All large files are labeled in the .gitignore file to avoid having them accidentally live in your commit history.
  5. A standalone references/ directory that stores all relevant literature, data dictionaries, or useful references that were used to help you during the project.
    • Use this directory to store physical copies of the .pdf files; or
    • Create a README.md file that cites external resources that were used.
  6. A standalone reports/ directory that stores your memo.md and presentation.pdf files
  7. A standalone notebooks/ directory that stores both your exploratory and report notebooks
    • A record of your workflow should be stored in notebooks/exploratory. Don't be afraid to leave in error messages, so you know what didn't work!
  8. A user-focused README.md file that briefly covers your process, methodology and findings.
    • Someone with no context on your project should be able to use this document to understand the structure of your project, and adapt your code for their needs.
  9. One final Jupyter Notebook file stored in notebooks/report that focuses on visualization and presentation
    • The very beginning of the notebook contains a description of the purpose of the notebook.
      • This is helpful for your future self and anyone of your colleagues that needs to view your notebook. Without this context, you’re implicitly asking your peers to invest a lot of energy to help solve your problem. Help them to jump into your project by providing them the purpose of this Jupyter Notebook.
    • Explanation of the data sources and where one can retrieve them
      • Whenever possible, link to the corresponding data dictionary
    • Custom functions and classes are imported from Python modules and are not created directly in the notebook. As soon as you have a working function in one of your exploratory notebooks, copy it over to src so it is reusable.
    • At least 4 meaningful data visualizations, with corresponding interpretations. All visualizations are well labeled with axes labels, a title, and a legend (when appropriate)
    • Take the time to make sure that you craft your story well, and clearly explain your process and findings in a way that clearly shows both your technical expertise and your ability to communicate your results!
  10. A one-page memo stored in reports/memo.md written exclusively for a non-technical stakeholder.
    • This memo should describe:
      • A summary of the business problem you are trying to solve
      • Key takeaways from your solution
      • A section on next steps if you had more time (i.e. one additional week)
  11. An "Executive Summary" Keynote/PowerPoint/Google Slide presentation (delivered as a PDF export) that explains what you have found.
    • Make sure to also add and commit this file as presentation.pdf of your non-technical presentation to your repository with a file name of reports/presentation.pdf.
    • Contain between 5-10 professional quality slides detailing:
      • A high-level overview of your methodology
      • The results you’ve uncovered
      • Any real-world recommendations you would like to make based on your findings (ask yourself--why should the executive team care about what you found? How can your findings help the company/stakeholder?)
      • Avoid technical jargon and explain results in a clear, actionable way for non-technical audiences.
    • The slides should use visualizations whenever possible, and avoid walls of text
    • All visualizations included in this presentation should also be exported as image files (e.g. with plt.savefig, not by taking a screenshot) and saved under reports/figures/

The Process

These steps are informed by Smart Vision's1 description of the CRISP-DM process.

1. Business Understanding

Start by reading this document, and making sure that you understand the kinds of questions being asked. In order to narrow your focus, you will likely want to make some design choices about your specific audience, rather than addressing all of the "many people" mentioned in the background section. Do you want to emphasize affordability, investment, or something else? This framing will help you choose which stakeholder claims to address.

Three things to be sure you establish during this phase are:

  1. Objectives: what questions are you trying to answer, and for whom?
  2. Project plan: you may want to establish more formal project management practices, such as daily stand-ups or using a Trello board, to plan the time you have remaining. Regardless you should determine the division of labor, communication expectations, and timeline.
  3. Success criteria: what does a successful project look like? How will you know when you have achieved it?

2. Data Understanding

Write a script to download the data (or instructions for future users on how to manually download it), and explore it. Do you understand what the columns mean? How do the three data tables relate to each other? How will you select the subset of relevant data? What kind of data cleaning is required?

It may be useful to generate visualizations of the data during this phase.

3. Data Preparation

Through SQL and Pandas, perform any necessary data cleaning and develop a query that pulls in all relevant data for analysis in a classification model, including any merging of tables. Be sure to document any data that you choose to drop or otherwise exclude. This is also the phase to consider any feature scaling or one-hot encoding required to feed the data into a classification model.

4. Modeling

The focus this time is on prediction. Good prediction is a matter of the model generalizing well. Steps we can take to assure good generalization include: testing the model on unseen data, cross-validation, and regularization. What sort of model should you build? A diverse portfolio is probably best. Classification models we've looked at so far include logistic regression, decision trees, bagging, and boosting, each of these with different flavors. You are encouraged to try any or all of these.

5. Evaluation

Recall that there are many different metrics we might use for evaluating a classification model. Accuracy is intuitive, but can be misleading, especially if you have class imbalances in your target. Perhaps, depending on you're defining things, it is more important to minimize false positives, or false negatives. It might therefore be more appropriate to focus on precision or recall. You might also calculate the AUC-ROC to measure your model's discrimination.

6. Deployment

In this case, your "deployment" comes in the form of the deliverables listed above. Make sure you can answer the following questions about your process:

  • "How did you pick the question(s) that you did?"
  • "Why are these questions important from a business perspective?"
  • "How did you decide on the data cleaning options you performed?"
  • "Why did you choose a given method or library?"
  • "Why did you select those visualizations and what did you learn from each of them?"
  • "Why did you pick those features as predictors?"
  • "How would you interpret the results?"
  • "How confident are you in the predictive quality of the results?"
  • "What are some of the things that could cause the results to be wrong?"

Grading Rubric

We are not using the online rubric, but it is here for your reference.

Citation

  1. "What is the CRISP-DM Methodology?" Smart Vision Europe. Available at: https://www.sv-europe.com/crisp-dm-methodology/

dsc-mod-3-project-seattle-ds-012720's People

Contributors

loredirick avatar hoffm386 avatar

Watchers

James Cloos avatar  avatar Mohawk Greene avatar Victoria Thevenot avatar Bernard Mordan avatar Otha avatar raza jafri avatar  avatar Joe Cardarelli avatar The Learn Team avatar  avatar  avatar Ben Oren avatar Matt avatar Antoin avatar Alex Griffith avatar  avatar Amanda D'Avria avatar  avatar Ahmed avatar Nicole Kroese  avatar Kaeland Chatman avatar  avatar Lisa Jiang avatar Vicki Aubin avatar Maxwell Benton avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.