Giter Site home page Giter Site logo

justinm0rgan / kings_county_housing_project Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 10.1 MB

Linear Regression model to predict housing prices in Kings County Washington. (Flatiron Project 2)

Jupyter Notebook 79.39% HTML 20.61%
housing-prices kings-county washington descriptive-statistics eda

kings_county_housing_project's Introduction

Kings County Housing Price Project

Author: Justin Williams

Overview

The goal of this project was to create a linear regression model that would accurately predict housing prices in Kings County, Washington. A years worth of data spanning from 5/2/2014 - 5/27/2015 was provided. An extensive exploratory data analysis (EDA) was conducted, in which descriptive statistics and and visualizations allowed a better understanding linearity assumptions and data distributions. Statistical tests for significance were conducted, and a number of features were engineered. Through application linear regression, and various feature selection techniques, a model was retained that accurately predicts housing prices on test data for Kings County Washington with a ~$210k margin of error.

Business Problem

This project's goal was to develop a model that accurately predicts housing prices in Kings County, Washington. This model can subsequently be used by many constituents within the real estate industry. Agents and brokers will be able to utilize their domain specific knowledge as well as this model, to predict trends in housing prices. Additionally, home-buyers and sellers, could use this to to help assist with the decision making process on when to buy or sell a home.

That being said, the main question was:

  • What attributes of the data contributed to higher housing prices, and which of those had a greater effect?

Once the aforementioned was answered, features could be included and engineered, that would best contribute to the models ability to predict housing prices.

Data

The data being used for this project provided various attributes for homes that had sold from 5/2/2014 - 5/27/2015. The price target variable was included, which allowed comparison of relationships between the various attributes and the target.

The data schema is as follows:

Column Names and descriptions for Kings County Data Set

# name description
1 id unique ID for a house
2 date Date day house was sold
3 price Price is prediction target
4 bedrooms Number of bedrooms
5 bathrooms Number of bathrooms
6 sqft_living square footage of the home
7 sqft_lot square footage of the lot
8 floors Total floors (levels) in house
9 waterfront Whether house has a view to a waterfront
10 view Number of times house has been viewed
11 condition How good the condition is (overall)
12 grade overall grade given to the housing unit, based on King County grading system
13 sqft_above square footage of house (apart from basement)
14 sqft_basement square footage of the basement
15 yr_built Year when house was built
16 yr_renovated Year when house was renovated
17 zipcode zip code in which house is located
18 lat Latitude coordinate
19 long Longitude coordinate
20 sqft_living15 The square footage of interior housing living space for the nearest 15 neighbors
21 sqft_lot15 The square footage of the land lots of the nearest 15 neighbors

Primary consideration was given to obvious indicators such as, total square feet, amount of bedrooms and location. However, through EDA and other methodologies, additional factors proved to be salient price indicators.

Methods

The EDA began with basic descriptive statistics such as; mean, median, inter-quartile range, min and max. These were subsequently visualized through scatter plots to understand linear relationships with the target variable price. Histograms were created to view data distributions and aid in outlier identification. Outliers were handled on a case by case basis through descriptive statistics interpretation where appropriate. Once basic data cleaning was done, some statistical tests were administered to get a better understanding of the significance of assumptions made during the EDA process.

Statistical tests for significance prompted various features to be engineered, for example:

If a house was built before the median year houses were built in this dataset, and has NOT received a renovation, was the mean/median price of said house significantly different then those that received renovations?

Engineered features such as the aforementioned, allowed patterns to be discovered within the data that were not initially discoverable. After feature engineering, the data was split into separate training and test sets. Once fit each set, the model was run on the training and test set respectively. The lack of significant difference between output of train and test split allowed for the assumption the model that was not over fit. Some feature selection techniques were administered to see if altering feature selection would have a positive impact on the model. The best model included all features from analysis and was subsequently fit to the entire data set and saved for future use.

Results

Some results came out as expected. For instance, total square feet of living space, number of bedrooms and location had a high influence on price. However, other aspects that initially did not seem as important proved strong price indicators. For instance, view proved to be an important factor, not solely how high the rating was within the scoring system, but more significant was, did the house have a view at all?. Grade , which was the overall grade given to the housing unit, based on King County grading system, was a strong indicator of price. This was an aggregate of multiple aspects of each home, however if grade scored well it was a more accurate price indicator. Obvious relationships were confirmed and combined with less-obvious to create this model that predicted within ~$210k of housing prices in Kings County, Washington.

Relatiionships to target variable of price

graph

  • Clusters of points in an upward rightward diagonal indicates how strong the relationship of that particular variable is to the target price
  • Points aligned in columns indicates a grading system like view and grade.

Geographic outliers

graph1

  • Heat map shows geographic dispersion of houses within the dataset.
  • Points on map to the east indicate homes where the mean price was 50% less then the main aggregate to the west

Conclusions

A combination of many of the data points and engineered features where used to develop this model. It predicts housing prices with a ~$210k margin of error. For next steps I would like to complete a more thorough geospatial analysis sub-setting for various indicators utilizing proximation to amenities such as public transit, schools, and other places of interest. Perhaps looking at census data for demographics information as well could help minimize error and provide a more accurate predictive model.


For More Information

Please review full analysis in the following Jupyter Notebook.

For any additional questions, please contact: Justin Williams, [email protected]

Repository Structure

Structure of repository and its contents:

├── README.md                                 <- The top-level README for reviewers of this project
├── notebooks                                 <- Narrative documentation of analysis in Jupyter notebook
├── data                                      <- Both sourced externally and generated from code
└── images                                    <- Generated from code

kings_county_housing_project's People

Contributors

justinm0rgan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.