Giter Site home page Giter Site logo

cheekeet86 / project_2 Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 23.56 MB

AMES Housing Prices (General Assembly SG Data Science Immersive Batch 9)

Home Page: https://www.kaggle.com/c/dsi-us-6-project-2-regression-challenge

Jupyter Notebook 100.00%
general-assembly data-science housing

project_2's Introduction

Project 2 - Ames Housing Data and Kaggle Challenge

Welcome to Project 2! It's time to start modeling.

Primary Learning Objectives:

  1. Creating and iteratively refining a regression model
  2. Using Kaggle to practice the modeling process
  3. Providing business insights through reporting and presentation.

You are tasked with creating a regression model based on the Ames Housing Dataset. This model will predict the price of a house at sale.

The Ames Housing Dataset is an exceptionally detailed and robust dataset with over 70 columns of different features relating to houses.

Secondly, we are hosting a competition on Kaggle to give you the opportunity to practice the following skills:

  • Refining models over time
  • Use of train-test split, cross-validation, and data with unknown values for the target to simulate the modeling process
  • The use of Kaggle as a place to practice data science

As always, you will be submitting a technical report and a presentation. You may find that the best model for Kaggle is not the best model to address your data science problem.

Set-up

Before you begin working on this project, please do the following:

  1. Sign up for an account on Kaggle
  2. IMPORTANT: Click this link (Regression Challenge Sign Up) to join the competition (otherwise you will not be able to make submissions!)
  3. Review the material on the DSI-US-6 Regression Challenge
  4. Review the data description.

The Modeling Process

  1. The train dataset has all of the columns that you will need to generate and refine your models. The test dataset has all of those columns except for the target that you are trying to predict in your Regression model.
  2. Generate your regression model using the training data. We expect that within this process, you'll be making use of:
    • train-test split
    • cross-validation / grid searching for hyperparameters
    • strong exploratory data analysis to question correlation and relationship across predictive variables
    • code that reproducibly and consistently applies feature transformation (such as the preprocessing library)
  3. Predict the values for your target column in the test dataset and submit your predictions to Kaggle to see how your model does against unknown data.
    • Note: Kaggle expects to see your submissions in a specific format. Check the challenge's page to make sure you are formatting your CSVs correctly!
    • You are limited to models you've learned in class. In other words, you cannot use XGBoost, Neural Networks or any other advanced model for this project.
  4. Evaluate your models!
    • consider your evaluation metrics
    • consider your baseline score
    • how can your model be used for inference?
    • why do you believe your model will generalize to new data?

Submission

Materials must be submitted by the beginning of class on 7th June 2019 (Friday).

Your technical report will be hosted on your Github.com. Make sure it includes:

  • A README.md (that isn't this file)
  • Jupyter notebook(s) with your analysis and models (renamed to describe your project)
  • At least one successful prediction submission on DSI-US-6 Regression Challenge -- you should see your name in the "Leaderboard" tab.
  • Data files
  • Presentation slides
  • Any other necessary files (images, etc.)

Presentation Structure

  • Must be within time limit established by local instructor.
  • Use Google Slides or some other visual aid (Keynote, Powerpoint, etc).
  • Consider the audience. Check with your local instructor for direction.
  • Start with the data science problem.
  • Use visuals that are appropriately scaled and interpretable.
  • Talk about your procedure/methodology (high level).
  • Talk about your primary findings.
  • Make sure you provide clear recommendations that follow logically from your analyses and narrative and answer your data science problem.

Be sure to rehearse and time your presentation before class.


Rubric

Your local instructor will evaluate your project (for the most part) using the following criteria. You should make sure that you consider and/or follow most if not all of the considerations/recommendations outlined below while working through your project.

Scores will be out of 27 points based on the 9 items in the rubric.
3 points per section

Score Interpretation
0 Project fails to meet the outlined expectations; many major issues exist.
1 Project close to meeting expectations; many minor issues or a few major issues.
2 Project meets expectations; few (and relatively minor) mistakes.
3 Project demonstrates a thorough understanding of all of the considerations outlined.

The Data Science Process

Problem Statement

  • Is it clear what the student plans to do?
  • What type of model will be developed?
  • How will success be evaluated?
  • Is the scope of the project appropriate?
  • Is it clear who cares about this or why this is important to investigate?
  • Does the student consider the audience and the primary and secondary stakeholders?

Data Cleaning and EDA

  • Are missing values imputed appropriately?
  • Are distributions examined and described?
  • Are outliers identified and addressed?
  • Are appropriate summary statistics provided?
  • Are steps taken during data cleaning and EDA framed appropriately?
  • Does the student address whether or not they are likely to be able to answer their problem statement with the provided data given what they've discovered during EDA?

Preprocessing and Modeling

  • Are categorical variables one-hot encoded?
  • Does the student investigate or manufacture features with linear relationships to the target?
  • Have the data been scaled appropriately?
  • Does the student properly split and/or sample the data for validation/training purposes?
  • Does the student utilize feature selection to remove noisy or multi-collinear features?
  • Does the student test and evaluate a variety of models to identify a production algorithm (AT MINIMUM: linear regression, lasso, and ridge)?
  • Does the student defend their choice of production model relevant to the data at hand and the problem?
  • Does the student explain how the model works and evaluate its performance successes/downfalls?

Evaluation and Conceptual Understanding

  • Does the student accurately identify and explain the baseline score?
  • Does the student select and use metrics relevant to the problem objective?
  • Is more than one metric utilized in order to better assess performance?
  • Does the student interpret the results of their model for purposes of inference?
  • Is domain knowledge demonstrated when interpreting results?
  • Does the student provide appropriate interpretation with regards to descriptive and inferential statistics?

Conclusion and Recommendations

  • Does the student provide appropriate context to connect individual steps back to the overall project?
  • Is it clear how the final recommendations were reached?
  • Are the conclusions/recommendations clearly stated?
  • Does the conclusion answer the original problem statement?
  • Does the student address how findings of this research can be applied for the benefit of stakeholders?
  • Are future steps to move the project forward identified?

Organization and Professionalism

Project Organization

  • Are modules imported correctly (using appropriate aliases)?
  • Are data imported/saved using relative paths?
  • Does the README provide a good executive summary of the project?
  • Is markdown formatting used appropriately to structure notebooks?
  • Are there an appropriate amount of comments to support the code?
  • Are files & directories organized correctly?
  • Are there unnecessary files included?
  • Do files and directories have well-structured, appropriate, consistent names?

Visualizations

  • Are sufficient visualizations provided?
  • Do plots accurately demonstrate valid relationships?
  • Are plots labeled properly?
  • Are plots interpreted appropriately?
  • Are plots formatted and scaled appropriately for inclusion in a notebook-based technical report?

Python Syntax and Control Flow

  • Is care taken to write human readable code?
  • Is the code syntactically correct (no runtime errors)?
  • Does the code generate desired results (logically correct)?
  • Does the code follows general best practices and style guidelines?
  • Are Pandas functions used appropriately?
  • Are sklearn methods used appropriately?

Presentation

  • Is the problem statement clearly presented?
  • Does a strong narrative run through the presentation building toward a final conclusion?
  • Are the conclusions/recommendations clearly stated?
  • Is the level of technicality appropriate for the intended audience?
  • Is the student substantially over or under time?
  • Does the student appropriately pace their presentation?
  • Does the student deliver their message with clarity and volume?
  • Are appropriate visualizations generated for the intended audience?
  • Are visualizations necessary and useful for supporting conclusions/explaining findings?

REMEMBER:

This is a learning environment and you are encouraged to try new things, even if they end up failing. While this rubric outlines what we look for in a good project, it is up to you to go above and beyond to create a great project. Learn from your failures and you'll be prepared to succeed in the workforce.

project_2's People

Contributors

cheekeet86 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.