Instructions and template for final projects.
Dragica Adamovic | 2019-03-22 |
---|---|
Your name here | Completion date |
Your repository should include the following:
- Python script for your analysis
- Results figure/saved file
- Dockerfile for your experiment
- runtime-instructions in a file named RUNME.md
The aim of this project is to predict the housing prices based on a hystorical, given, data and to quantify the relationship between the indicators. On the end, we want to test how accurate is our statistical regression model.
For the purpose of this study we have used the Boston housing dataset. This dataset contains the data of different housing around city of Boston. The main assumtpion of this parametric model is that the housing prices will be influenced by the same factor in past and in the future. We will analyse that data, and determine the nature of the relationship between the target variable and the other given attributes. If there is a high correlation coefficient between the two, we will create statistical model that will help us to predict the housing prices.
Boston housing dataset was extensivly used to test an algorithams and for machine learning. It was originally published by Harrison, D. and Rubinfeld, D.L. (https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). It contains 13 attributes of 506 cases. Therefore, this dataset is relativly small in size. The attributes given are listed below:
- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per $10,000
- PTRATIO - pupil-teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in $1000's
(Source: https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)
The MEDV value or the price is our target variable and the rest, all other attributes, are feature variables, based on which we can potentially predict the value of the house.
Find below the pseudocode for analysis of the Boston dataset.
-
First, we need to import the dataset, for instance, from sklearn 1.1 import the data from sklearn with the command from sklearn.import load_boston 1.2. store the data in a variable, for instance the command boston = boston.load()
-
We can explore our dataset, features, and shape 2.1 print the dataset features with the command print(boston.keys()) 2.2 print the dataset shape with the command print(boston.data.shape()) 2.3 since we have as one of the keys feature_names we can explore this by typing the command print(boston.feature_names) 2.4 also, we can print the description of this dataset by command print(boston.DESCR) 2.5 we can print the data, for instance first couple of rows in order to verify if our data is adequately stored, for this we can use 2.5.1 import the panadas library, command import pandas as pd 2.5.2 create a data frame, command and store it in a variable, for instance bos = pd.DataFrame(boston.data) 2.5.3 print the data using the command print(bos.head()) 2.5.4 if the name of the columns are just showing the index we need to rename it and store it in a variable, command boston_columns = boston.feature_names 2.5.5 thereafter, print the new table with columns head that are renamed, command print(bos.head())
-
Next, we can performe the summary statistics 3.1 print the summary statistics, print(bos.describe())
-
Explore the data with graphs and heat maps in order to see the relationship 4.1 import various libraries for printing different formats of images 4.2 print the correlation matrix with heatmap and save an image 4.3 print the matrix of hystograms and scaterplot of all data in a correlation matrix with the pairplote command and save an image
-
Split data to target value and predictor value 5.1 split the data to the target value to predictor value 5.1.1 we will store in a X variable predictor value, command X = bos.drop['PRICE', axis = 1] 5.1.2 we will store in a Y variable target value, price, command Y = bos['PRICE']
-
Split data into train and test data 6.1 split data to train and test data 6.2 print the shape of data
-
Create a regression model 7.1 import the libraries sklearn import linear regression model and store it in a variable 7.2 fit the curve, in our case line, thorugh the dataset lm.fit(X_train, Y_train) 7.3 predict, lm.predict(X_test) and store it in a variable vall Y_pred 7.4 print the graphs with data, predicted values and actual values
-
Calculate the mean square error/average of the square of errors 8.1 import the relevant libraries, if not yet imported 8.2 find the mse of predicited and actual values and store it in a variables 8.3 print the results
First graph is showing correlation heat map that give us an idea which features are highly correlated. It should be looked tougether with the image below that give us an idea of the nature of relationship between the data. As we can see, some data that are highly correlated would have a linear or non-linear correlation.
Results of the prediction of two linear regression model is shown at two graphs below. First model on the left is a Linear Regression model and on the right is a Bayesian Ridge model. As we can see, they have almost indentical grpahs. Also, the values of RMS error are very close.
With this analysis we have analyse data, and created the linear regression model that have alow us to predict the housing prices in a relation with feature that it has the high correlation. On the end, we have evaluated our model bu using the mean square error. Since, the mse is high we need ot test other regression models. Also, we have noticed when analysing the scatter plots in a correlation matrix some data have non-linear correlation. Potentialy, it would be interested to performe the analysis of non-linear models.
References are given above in the text