Giter Site home page Giter Site logo

gtu-ta-ml-proj's Introduction

Data Science Certificate Teaching Assistant

Machine learning project for the TA position at Georgetown.

Author:

Greg Barbieri - @gfbarbieri

Description of Task

The goal of this project is to demonstrate knowledge and use of the data science pipeline, as well as machine learning models and the Yellowbrick library. I accomplish these goals by predicting the income categories of households participating in the Consumer Expenditure (CE) Interview survey, a survey conducted by the Census Bureau on behalf of the Bureau of Labor Statistics (BLS).

Three notebooks were drafted, each covering distinct portions of the data science pipeline: (1) ingestion and wrangling, (2) feature selection, and (3) model selection and evaluation including hyper-parameter tuning. The request was for a single notebook that covers the entire pipeline and demonstrates competency with ML models and the Yellowbrick library, so I created an abridged version of the three parts.

Part 1:

Part 1 demonstrates ingesting data over the web, using preprocessing techniques to encode a target variable, aggregating data to create features, using visual techniques to assess the quality of the data, and storing the data using WORM methodology.

Part 2:

Part 2 demonstrates assessing the relationship between features, specifically utilizing the Yellowbrick library to assess colinearity and class separation. This part demonstrates multiple approaches to scaling features using the sklearn library. Lastly, this notebook covers feature selection using feature importance and feature elimination techniques.

Part 3:

Part 3 demonstrates preprocessing, modeling using multiple classifiers, model evaluation using Yellowbrick's classification model visualizers, hyper-parameter tuning, and model output.

Contents

data: Data resulting from ingestion and wrangling process. Data is ingested from the CE PUMD website and stored in a SQLite database table.
notebooks: The final notebook covering all competencies is Predicting_Household_Income.ipynb. Groups of topics are covered in three parts: ingestion and wrangling, feature selection, and model selection and evaluation.
model: Encoding, scaling, and model parameters.

Data Sources

  1. CE PUMD Website

Note on Prediction: Prediction typically means using past data to estimate future data, where past, present, and future take their usual place in time. Since the data I am using to predict household income (expenditures) was created chronologically after the household income was earned or expected to be earned, a better phrase is that I am explaining, inferring, or imputing household income. In fact, using data produced choronologically after the data one is trying to predict may become a source of data leakage. That said, in the real world, data leakage is more concerned about when the data is available to the model, not when it is theoretically generated by the user. That is, "future data" is simply the data you wish you had and "current" or "past" data are the data you have available now to make a prediction.

gtu-ta-ml-proj's People

Contributors

gfbarbieri avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.