Giter Site home page Giter Site logo

dsc-enterprise-deloitte-dl-decision-trees-lab-deloitte-online-ds-050520's Introduction

Building Trees using scikit-learn - Lab

Introduction

Following the toy example we saw in the previous lesson, we'll now build a decision tree for a more complex dataset. This lab covers all major areas of standard machine learning practice , from data acquisition to evaluation of results. We'll continue to use the scikit-learn and pandas libraries to conduct this analysis, following the same structure we saw in the previous lesson.

Objectives

You will be able to:

  • Use pandas to prepare the data for the scikit-learn decision tree algorithm
  • Train the classifier with a training dataset and evaluate performance using different measures
  • Visualize the decision tree and interpret the visualization

UCI Banknote Authentication Data Set

In this lab, we'll work with a popular dataset for classification called the "UCI Bank Note Authentication Dataset'. This Data were extracted from images that were taken from genuine and forged banknotes! The notes were first digitized, followed by a numerical transformation using DSP techniques. The final set of engineered features are all continuous in nature, meaning that our dataset consists entirely of floats, with no strings to worry about. If you're curious about how the dataset was created, you can visit the UCI link listed above to learn about feature engineering in detail!

We have following attributes in the dataset.

  1. Variance of Wavelet Transformed image (continuous)
  2. Skewness of Wavelet Transformed image (continuous)
  3. Curtosis of Wavelet Transformed image (continuous)
  4. Entropy of image (continuous)
  5. Class (integer) - Target/Label

Step 1: Import necessary Libraries

  • Import necessary libraries as we saw in previous lesson
# Import necessary libraries

## Your code here 

Step 2: Import Data

Now, we'll load our dataset in a DataFrame, perform some basic EDA, and generally get a feel for the data we'll be working with.

  • Read the file "data_banknote_authentication.csv" as a pandas dataframe. Note that there is no header information in this dataset.
  • Assign column names 'Variance', 'Skewness', 'Curtosis', 'Entropy', 'Class' to dataset in the given order.
  • View the basic statistics and shape of dataset.
  • Check for frequency of positive and negative examples in the target variable
# Create Dataframe

## Your code here 
# Describe the dataset

## Your code here 
# Shape of dataset

## Your code here 
# Class frequency of target variable 

## Your code here 

Step 3: Create Features and Labels, Training and Test Data

Now we need to create our feature set X and labels y.

  • Create X and y by selecting the appropriate columns from the dataset
  • Create a 80/20 split on the dataset for training/testing. Use random_state=10 for reproducibility
# Create features and labels

## Your code here 
# Perform an 80/20 split

## Your code here 

Step 4: Train the Classifier and Make Predictions

  • Create an instance of decision tree classifier with random_state=10 for reproducibility
  • Fit the training data to the model
  • USe the trained model to make predictions with test data
# Train a DT classifier

## Your code here 
# Make predictions for test data

## Your code here 

Step 5: Check Predictive Performance

We can now use different evaluation measures to check the predictive performance of the classifier.

  • Check the accuracy , AUC and create a confusion matrix
  • Interpret the results
# Calculate Accuracy , AUC and Confusion matrix 

## Your code here 

Bonus: Re-grow the Tree Using Entropy

SO in the above example, we used all default settings for decision tree classifier. The default impurity criterion in scikit-learn is the Gini impurity. We can change it back to entropy by passing in criterion='entropy' argument to the classifier in the training phase.

  • Repeat the above tasks for training, evaluation and visualization using Entropy measure. (
  • Compare and interpret the results
## Your code here 

Level up - Optional

  • We discussed earlier that decision trees are very sensitive towards outliers. Try to identify and remove/fix any possible outliers in the dataset.
  • Check the distributions of the data. Is there any room for normalization/scaling of data ? Apply these techniques and see if it improves upon accuracy score.

Summary

In this lesson, we looked at growing a decision tree for banknote authentication dataset which is composed of extracted continuous features from photographic data. We looked at different stages of the experiment including data acquisition, training, prediction and evaluation. We also looked at growing trees using entropy vs. gini impurity criteria. In following lessons, we shall look at some more such pre-train tuning techniques for ensuring an optimal classifier for learning and prediction.

dsc-enterprise-deloitte-dl-decision-trees-lab-deloitte-online-ds-050520's People

Contributors

loredirick avatar mike-kane avatar shakeelraja avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.