Giter Site home page Giter Site logo

adchang91 / dsc-1-11-11-multiple-linear-regression-in-statsmodels-lab-online-ds-pt-031119 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from learn-co-students/dsc-1-11-11-multiple-linear-regression-in-statsmodels-lab-online-ds-pt-031119

0.0 1.0 0.0 19 KB

License: Other

Jupyter Notebook 100.00%

dsc-1-11-11-multiple-linear-regression-in-statsmodels-lab-online-ds-pt-031119's Introduction

Multiple Linear Regression in Statsmodels - Lab

Introduction

In this lab, you'll practice fitting a multiple linear regression model on our Boston Housing Data set!

Objectives

You will be able to:

  • Run linear regression on Boston Housing dataset with all the predictors
  • Interpret the parameters of the multiple linear regression model

The Boston Housing Data

We pre-processed the Boston Housing Data again. This time, however, we did things slightly different:

  • We dropped "ZN" and "NOX" completely
  • We categorized "RAD" in 3 bins and "TAX" in 4 bins
  • We used min-max-scaling on "B", "CRIM" and "DIS" (and logtransformed all of them first, except "B")
  • We used standardization on "AGE", "INDUS", "LSTAT" and "PTRATIO" (and logtransformed all of them first, except for "AGE")
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
boston_features = boston_features.drop(["NOX","ZN"],axis=1)

# first, create bins for based on the values observed. 3 values will result in 2 bins
bins = [0,6,  24]
bins_rad = pd.cut(boston_features['RAD'], bins)
bins_rad = bins_rad.cat.as_unordered()

# first, create bins for based on the values observed. 4 values will result in 3 bins
bins = [0, 270, 360, 712]
bins_tax = pd.cut(boston_features['TAX'], bins)
bins_tax = bins_tax.cat.as_unordered()

tax_dummy = pd.get_dummies(bins_tax, prefix="TAX")
rad_dummy = pd.get_dummies(bins_rad, prefix="RAD")
boston_features = boston_features.drop(["RAD","TAX"], axis=1)
boston_features = pd.concat([boston_features, rad_dummy, tax_dummy], axis=1)
age = boston_features["AGE"]
b = boston_features["B"]
logcrim = np.log(boston_features["CRIM"])
logdis = np.log(boston_features["DIS"])
logindus = np.log(boston_features["INDUS"])
loglstat = np.log(boston_features["LSTAT"])
logptratio = np.log(boston_features["PTRATIO"])

# minmax scaling
boston_features["B"] = (b-min(b))/(max(b)-min(b))
boston_features["CRIM"] = (logcrim-min(logcrim))/(max(logcrim)-min(logcrim))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))

#standardization
boston_features["AGE"] = (age-np.mean(age))/np.sqrt(np.var(age))
boston_features["INDUS"] = (logindus-np.mean(logindus))/np.sqrt(np.var(logindus))
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))
boston_features["PTRATIO"] = (logptratio-np.mean(logptratio))/(np.sqrt(np.var(logptratio)))
boston_features.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CRIM INDUS CHAS RM AGE DIS PTRATIO B LSTAT RAD_(0, 6] RAD_(6, 24] TAX_(0, 270] TAX_(270, 360] TAX_(360, 712]
0 0.000000 -1.704344 0.0 6.575 -0.120013 0.542096 -1.443977 1.000000 -1.275260 1 0 0 1 0
1 0.153211 -0.263239 0.0 6.421 0.367166 0.623954 -0.230278 1.000000 -0.263711 1 0 1 0 0
2 0.153134 -0.263239 0.0 7.185 -0.265812 0.623954 -0.230278 0.989737 -1.627858 1 0 1 0 0
3 0.171005 -1.778965 0.0 6.998 -0.809889 0.707895 0.165279 0.994276 -2.153192 1 0 1 0 0
4 0.250315 -1.778965 0.0 7.147 -0.511180 0.707895 0.165279 1.000000 -1.162114 1 0 1 0 0

Run an linear model in Statsmodels

Run the same model in Scikit-learn

Remove the necessary variables to make sure the coefficients are the same for Scikit-learn vs Statsmodels

Statsmodels

Scikit-learn

Interpret the coefficients for PTRATIO, PTRATIO, LSTAT

  • CRIM: per capita crime rate by town
  • INDUS: proportion of non-retail business acres per town
  • CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • RM: average number of rooms per dwelling
  • AGE: proportion of owner-occupied units built prior to 1940
  • DIS: weighted distances to five Boston employment centres
  • RAD: index of accessibility to radial highways
  • TAX: full-value property-tax rate per $10,000
  • PTRATIO: pupil-teacher ratio by town
  • B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  • LSTAT: % lower status of the population

Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

  • CRIM: 0.15
  • INDUS: 6.07
  • CHAS: 1
  • RM: 6.1
  • AGE: 33.2
  • DIS: 7.6
  • PTRATIO: 17
  • B: 383
  • LSTAT: 10.87
  • RAD: 8
  • TAX: 284

Summary

Congratulations! You've fitted your first multiple linear regression model on the Boston Housing Data.

dsc-1-11-11-multiple-linear-regression-in-statsmodels-lab-online-ds-pt-031119's People

Contributors

loredirick avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.