Giter Site home page Giter Site logo

dsc-ridge-and-lasso-regression-lab-online-ds-ft-090919's Introduction

Ridge and Lasso Regression - Lab

Introduction

In this lab, you'll practice your knowledge of Ridge and Lasso regression!

Objectives

In this lab you will:

  • Use Lasso and Ridge regression with scikit-learn
  • Compare and contrast Lasso, Ridge and non-regularized regression

Housing Prices Data

Let's look at yet another house pricing dataset:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at .info() of the data:

# Your code here
  • First, split the data into X (predictor) and y (target) variables
  • Split the data into 75-25 training-test sets. Set the random_state to 10
  • Remove all columns of object type from X_train and X_test and assign them to X_train_cont and X_test_cont, respectively
# Create X and y
y = None
X = None

# Split data into training and test sets
X_train, X_test, y_train, y_test = None

# Remove "object"-type features from X
cont_features = None

# Remove "object"-type features from X_train and X_test
X_train_cont = None
X_test_cont = None

Let's use this data to build a first naive linear regression model

  • Fill the missing values in data using median of the columns (use SimpleImputer)
  • Fit a linear regression model to this data
  • Compute the R-squared and the MSE for both the training and test sets
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer

# Impute missing values with median using SimpleImputer
impute = None
X_train_imputed = None
X_test_imputed = None

# Fit the model and print R2 and MSE for training and test sets
linreg = None

# Print R2 and MSE for training and test sets

Normalize your data

  • Normalize your data using a StandardScalar
  • Fit a linear regression model to this data
  • Compute the R-squared and the MSE for both the training and test sets
from sklearn.preprocessing import StandardScaler

# Scale the train and test data
ss = None
X_train_imputed_scaled = None
X_test_imputed_scaled = None

# Fit the model
linreg_norm = None


# Print R2 and MSE for training and test sets

Include categorical variables

The above models didn't include categorical variables so far, let's include them!

  • Include all columns of object type from X_train and X_test and assign them to X_train_cat and X_test_cat, respectively
  • Fill missing values in all these columns with the string 'missing'
# Create X_cat which contains only the categorical variables
features_cat = None
X_train_cat = None
X_test_cat = None

# Fill missing values with the string 'missing'
  • One-hot encode all these categorical columns using OneHotEncoder
  • Transform the training and test DataFrames (X_train_cat) and (X_test_cat)
  • Run the given code to convert these transformed features into DataFrames
from sklearn.preprocessing import OneHotEncoder

# OneHotEncode categorical variables
ohe = None

# Transform training and test sets
X_train_ohe = None
X_test_ohe = None

# Convert these columns into a DataFrame
columns = ohe.get_feature_names(input_features=X_train_cat.columns)
cat_train_df = pd.DataFrame(X_train_ohe.todense(), columns=columns)
cat_test_df = pd.DataFrame(X_test_ohe.todense(), columns=columns)
  • Combine X_train_imputed_scaled and cat_train_df into a single DataFrame
  • Similarly, combine X_test_imputed_scaled and cat_test_df into a single DataFrame
# Your code here
X_train_all = None
X_test_all = None

Now build a linear regression model using all the features (X_train_all). Also, print the R-squared and the MSE for both the training and test sets.

# Your code here

Notice the severe overfitting above; our training R-squared is very high, but the test R-squared is negative! Similarly, the scale of the test MSE is orders of magnitude higher than that of the training MSE.

Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables, X_train_all) to build two models - one each for Lasso and Ridge regression. Each time, look at R-squared and MSE.

Lasso

With default parameter (alpha = 1)

# Your code here

With a higher regularization parameter (alpha = 10)

# Your code here

Ridge

With default parameter (alpha = 1)

# Your code here

With default parameter (alpha = 10)

# Your code here

Compare the metrics

Write your conclusions here:


Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Use 10**(-10) as an estimate that is very close to 0.

# Number of Ridge params almost zero
# Number of Lasso params almost zero
print(len(lasso.coef_))
print(sum(abs(lasso.coef_) < 10**(-10))/ len(lasso.coef_))

Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

Put it all together

To bring all of our work together lets take a moment to put all of our preprocessing steps for categorical and continuous variables into one function. This function should take in our features as a dataframe X and target as a Series y and return a training and test DataFrames with all of our preprocessed features along with training and test targets.

def preprocess(X, y):
    '''Takes in features and target and implements all preprocessing steps for categorical and continuous features returning 
    train and test DataFrames with targets'''
    
    # Train-test split (75-25), set seed to 10

    
    # Remove "object"-type features and SalesPrice from X


    # Impute missing values with median using SimpleImputer


    # Scale the train and test data


    # Create X_cat which contains only the categorical variables


    # Fill nans with a value indicating that that it is missing


    # OneHotEncode Categorical variables

    
    # Combine categorical and continuous features into the final dataframe
    
    return X_train_all, X_test_all, y_train, y_test

Graph the training and test error to find optimal alpha values

Earlier we tested two values of alpha to see how it effected our MSE and the value of our coefficients. We could continue to guess values of alpha for our Ridge or Lasso regression one at a time to see which values minimize our loss, or we can test a range of values and pick the alpha which minimizes our MSE. Here is an example of how we would do this:

X_train_all, X_test_all, y_train, y_test = preprocess(X, y)

train_mse = []
test_mse = []
alphas = []

for alpha in np.linspace(0, 200, num=50):
    lasso = Lasso(alpha=alpha)
    lasso.fit(X_train_all, y_train)
    
    train_preds = lasso.predict(X_train_all)
    train_mse.append(mean_squared_error(y_train, train_preds))
    
    test_preds = lasso.predict(X_test_all)
    test_mse.append(mean_squared_error(y_test, test_preds))
    
    alphas.append(alpha)
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots()
ax.plot(alphas, train_mse, label='Train')
ax.plot(alphas, test_mse, label='Test')
ax.set_xlabel('Alpha')
ax.set_ylabel('MSE')

# np.argmin() returns the index of the minimum value in a list
optimal_alpha = alphas[np.argmin(test_mse)]

# Add a vertical line where the test MSE is minimized
ax.axvline(optimal_alpha, color='black', linestyle='--')
ax.legend();

print(f'Optimal Alpha Value: {int(optimal_alpha)}')

Take a look at this graph of our training and test MSE against alpha. Try to explain to yourself why the shapes of the training and test curves are this way. Make sure to think about what alpha represents and how it relates to overfitting vs underfitting.

Summary

Well done! You now know how to build Lasso and Ridge regression models, use them for feature selection and find an optimal value for $\text{alpha}$.

dsc-ridge-and-lasso-regression-lab-online-ds-ft-090919's People

Contributors

loredirick avatar bmcgarry194 avatar mas16 avatar sumedh10 avatar fpolchow avatar taylorhawks avatar

Watchers

James Cloos avatar  avatar Mohawk Greene avatar Victoria Thevenot avatar Bernard Mordan avatar Otha avatar raza jafri avatar  avatar Joe Cardarelli avatar The Learn Team avatar Sophie DeBenedetto avatar  avatar  avatar Matt avatar Antoin avatar  avatar Alex Griffith avatar  avatar Amanda D'Avria avatar  avatar Ahmed avatar Nicole Kroese  avatar Kaeland Chatman avatar Lisa Jiang avatar Vicki Aubin avatar Maxwell Benton avatar  avatar  avatar

Forkers

atksmith88

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.