Giter Site home page Giter Site logo

dsc-pca-in-scikitlearn-lab-online-ds-pt-100719's Introduction

Principal Component Analysis in scikit-learn - Lab

Introduction

Now that you've seen a brief introduction to PCA, it's time to use scikit-learn to run PCA on your own.

Objectives

In this lab you will:

  • Implement PCA using the scikit-learn library
  • Determine the optimal number of n components when performing PCA by observing the explained variance
  • Plot the decision boundary of classification experiments to visually inspect their performance

Iris dataset

To practice PCA, you'll take a look at the iris dataset. Run the cell below to load it.

from sklearn import datasets
import pandas as pd
 
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['Target'] = iris.get('target')
df.head()

Before performing PCA and visualizing the principal components, it's helpful to get a little more context regarding the data that you'll be working with. Run the cell below in order to visualize the pairwise feature plots. With this, notice how the target labels are easily separable by any one of the given features.

import matplotlib.pyplot as plt
%matplotlib inline

pd.plotting.scatter_matrix(df, figsize=(10,10));
  • Assign all columns in the following features list to X
  • Assign the 'Target' column to y
# Create features and target datasets
features = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
X = None
y = None

Standardize all the columns in X using StandardScaler.

# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Standardize the features
X = None

# Preview X
pd.DataFrame(data=X, columns=features).head()

PCA Projection to 2-D Space

Now its time to perform PCA! Project the original data which is 4 dimensional into 2 dimensions. The new components are just the two main dimensions of variance present in the data.

  • Initialize an instance of PCA from scikit-learn with two components
  • Fit the data to the model
  • Extract the first two principal components from the trained model
# Import PCA


# Instantiate PCA
pca = None

# Fit PCA
principalComponents = None

To visualize the components, it will be useful to also look at the target associated with the particular observation. As such, append the target (flower type) to the principal components in a pandas dataframe.

# Create a new dataset from principal components 

Great, you now have a set of two dimensions, reduced from four against our target variable, the flower type.

Visualize Principal Components

Using the target data, we can visualize the principal components according to the class distribution.

  • Create a scatter plot from principal components while color coding the examples according to what flower type each example is classified as
# Principal Componets scatter plot


# Your code here 

Explained Variance

You can see above that the three classes in the dataset are fairly well separable. As such, this compressed representation of the data is probably sufficient for the classification task at hand. Compare the variance in the overall dataset to what was captured from your two primary components.

# Calculate the variance explained by pricipal components
print('Variance of each component:', None)
print('\n Total Variance Explained:', None)

As you should see, these first two principal components account for the vast majority of the overall variance in the dataset. This is indicative of the total information encapsulated in the compressed representation compared to the original encoding.

Compare Performance of a Classifier with PCA

Since the principal components explain 95% of the variance in the data, it is interesting to consider how a classifier trained on the compressed version would compare to one trained on the original dataset.

  • Run a KNeighborsClassifier to classify the Iris dataset
  • Use a train/test split of 80/20
  • For the reproducibility of results, set random_state=9 for the split
  • Time the process for splitting, training and making predictions
# Classification - complete Iris dataset

# Your code here 

Great, so you can see that we are able to classify the data with 100% accuracy in the given time. Remember the time taken may be different based on the load on your CPU and number of processes running on your machine.

Now repeat the above process for the dataset made from principal components:

  • Run a KNeighborsClassifier to classify the Iris dataset with principal components
  • Use a train/test split of 80/20
  • For the reproducibility of results, set random_state=9 for the split
  • Time the process for splitting, training and making predictions
# Classification - reduced (PCA) Iris dataset

Although some accuracy is lost in this representation of the data, we were able to use half of the number of features to train the model!

In more complex cases, PCA can even improve the accuracy of some machine learning tasks. In particular, PCA can be useful to reduce overfitting.

Visualize the Learned Decision Boundary

Run the cell below to visualize the decision boundary learned by the k-nearest neighbor classification model trained using the principal components of the data.

# Plot decision boundary using principal components 
import numpy as np 
def decision_boundary(pred_func):
    
    # Set the boundary
    x_min, x_max = X.iloc[:, 0].min() - 0.5, X.iloc[:, 0].max() + 0.5
    y_min, y_max = X.iloc[:, 1].min() - 0.5, X.iloc[:, 1].max() + 0.5
    h = 0.01
    
    # Build meshgrid
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot the contour
    plt.figure(figsize=(15,10))
    plt.contourf(xx, yy, Z, cmap=plt.cm.afmhot)
    plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, cmap=plt.cm.Spectral, marker='x')

decision_boundary(lambda x: model.predict(x))

plt.title('decision boundary');

Summary

In this lab, you applied PCA to the popular Iris dataset. You looked at the performance of a simple classifier and the impact of PCA on the accuracy of the model and the time it took to run the model. From here, you'll continue to explore PCA at more fundamental levels.

dsc-pca-in-scikitlearn-lab-online-ds-pt-100719's People

Contributors

shakeelraja avatar loredirick avatar lmcm18 avatar alexgriff avatar h-parker avatar sumedh10 avatar mathymitchell avatar

Watchers

James Cloos avatar  avatar Mohawk Greene avatar Victoria Thevenot avatar Bernard Mordan avatar Otha avatar raza jafri avatar  avatar Joe Cardarelli avatar The Learn Team avatar  avatar  avatar Ben Oren avatar Matt avatar Antoin avatar  avatar  avatar Amanda D'Avria avatar  avatar Ahmed avatar Nicole Kroese  avatar Dominique De León avatar Kaeland Chatman avatar  avatar Lisa Jiang avatar Vicki Aubin avatar Maxwell Benton avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.