Giter Site home page Giter Site logo

ds-xgboost-lab-nyc-ds-021720's Introduction

XGBoost Lab

Objective

In this lab, we'll install the popular XGBoost Library and explore how to use this popular boosting model to classify different types of wine using the Wine Quality Dataset from the UCI Machine Learning Dataset Repository.

Step 1: Install XGBoost

The XGBoost model is not currently included in scikit-learn, so we'll have to install it on our own.

Install the library using conda install py-xgboost.

Run the cell below to import everything we'll need for this lab.

import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.grid_search import GridSearchCV
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

The dataset we'll be using for this lab is currently stored in the file winequality-red.csv.

In the cell below, use pandas to import the dataset into a dataframe, and inspect the head of the dataframe to ensure everything loaded correctly.

df = None

For this lab, our target variable will be quality . That makes this a multiclass classification problem. Given the data in the columns from fixed_acidity through alcohol, we'll predict the quality of the wine.

This means that we need to store our target variable separately from the dataset, and then split the data and labels into training and testing sets that we can use for cross-validation.

In the cell below:

  • Store the quality column in the labels variable and then remove the column from our dataset.
  • Create a StandardScaler object and scale the data using the fit_transform() method.
  • Split the data into training and testing sets using the appropriate method from sklearn.
labels = None
labels_removed_df = None
scaler =None
scaled_df = None

# Calculate X_train, X_test, y_train, y_test 

Now that we have prepared our data for modeling, we can use XGBoost to build a model that can accurately classify wine quality based on the features of the wine!

The API for xgboost is purposefully written to mirror the same structure as other models in scikit-learn.

clf = None
# clf.fit(None, None)
training_preds = None
val_preds = None
training_accuracy = None
val_accuracy =None

# print("Training Accuracy: {:.4}%".format(training_accuracy * 100))
# print("Validation accuracy: {:.4}%".format(val_accuracy * 100))

Tuning XGBoost

Our model had somewhat lackluster performance on the testing set compared to the training set, suggesting the model is beginning to overfit the training data. Let's tune the model to increase the model performance and prevent overfitting.

For a full list of model parameters, see theXGBoost Documentation.

Many of the parameters we'll be tuning are parameters we've already encountered when working with Decision Trees, Random Forests, and Gradient Boosted Trees.

Examine the tunable parameters for XGboost, and then fill in appropriate values for the param_grid dictionary in the cell below. Put values you want to test out for each parameter inside the corresponding arrays in param_grid.

NOTE: Remember, GridSearchCV finds the optimal combination of parameters through an exhaustive combinatoric search. If you search through too many parameters, the model will take forever to run! For the sake of time, we recommend trying no more than 3 values per parameter for the following steps.

param_grid = {
    "learning_rate": None,
    'max_depth': None,
    'min_child_weight': None,
    'subsample': None,
    'n_estimators': None,
}

Now that we have constructed our params dictionary, create a GridSearchCV object in the cell below and use it to iterate tune our XGBoost model.

grid_clf = None
# grid_clf.fit(None, None)

# print("Grid Search found the following optimal parameters: ")
# for param_name in sorted(best_parameters.keys()):
#     print("%s: %r" % (param_name, best_parameters[param_name]))

training_preds = None
val_preds = None
training_accuracy = None
val_accuracy = None

# print("")
# print("Training Accuracy: {:.4}%".format(training_accuracy * 100))
# print("Validation accuracy: {:.4}%".format(val_accuracy * 100))

That's a big improvement! We've increased our validation accuracy by around 10%, and we've also stopped the model from overfitting.

Conclusion

Great! We've now successfully made use of one of the most powerful Boosting models in data science for modeling. We've also learned how to tune the model for better performance using the Grid Search methodology we learned previously. XGBoost is a powerful modeling tool to have in your arsenal. Don't be afraid to experiment with it when modeling.

ds-xgboost-lab-nyc-ds-021720's People

Contributors

mike-kane avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.