Principal Component Analysis in scikit-learn - Lab

Introduction

Now that you've seen a brief introduction to PCA, it's time to use scikit-learn to run PCA on your own.

Objectives

In this lab you will:

Implement PCA using the scikit-learn library
Determine the optimal number of n components when performing PCA by observing the explained variance
Plot the decision boundary of classification experiments to visually inspect their performance

Iris dataset

To practice PCA, you'll take a look at the iris dataset. Run the cell below to load it.

from sklearn import datasets
import pandas as pd
 
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['Target'] = iris.get('target')
df.head()

Before performing PCA and visualizing the principal components, it's helpful to get a little more context regarding the data that you'll be working with. Run the cell below in order to visualize the pairwise feature plots. With this, notice how the target labels are easily separable by any one of the given features.

import matplotlib.pyplot as plt
%matplotlib inline

pd.plotting.scatter_matrix(df, figsize=(10,10));

Assign all columns in the following features list to X
Assign the 'Target' column to y

# Create features and target datasets
features = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
X = None
y = None

Standardize all the columns in X using StandardScaler.

# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Standardize the features
X = None

# Preview X
pd.DataFrame(data=X, columns=features).head()

PCA Projection to 2-D Space

Now its time to perform PCA! Project the original data which is 4 dimensional into 2 dimensions. The new components are just the two main dimensions of variance present in the data.

Initialize an instance of PCA from scikit-learn with two components
Fit the data to the model
Extract the first two principal components from the trained model

# Import PCA


# Instantiate PCA
pca = None

# Fit PCA
principalComponents = None

To visualize the components, it will be useful to also look at the target associated with the particular observation. As such, append the target (flower type) to the principal components in a pandas dataframe.

# Create a new dataset from principal components

Great, you now have a set of two dimensions, reduced from four against our target variable, the flower type.

Visualize Principal Components

Using the target data, we can visualize the principal components according to the class distribution.

Create a scatter plot from principal components while color coding the examples according to what flower type each example is classified as

# Principal Componets scatter plot


# Your code here

Explained Variance

You can see above that the three classes in the dataset are fairly well separable. As such, this compressed representation of the data is probably sufficient for the classification task at hand. Compare the variance in the overall dataset to what was captured from your two primary components.

# Calculate the variance explained by pricipal components
print('Variance of each component:', None)
print('\n Total Variance Explained:', None)

As you should see, these first two principal components account for the vast majority of the overall variance in the dataset. This is indicative of the total information encapsulated in the compressed representation compared to the original encoding.

Compare Performance of a Classifier with PCA

Since the principal components explain 95% of the variance in the data, it is interesting to consider how a classifier trained on the compressed version would compare to one trained on the original dataset.

Run a KNeighborsClassifier to classify the Iris dataset
Use a train/test split of 80/20
For the reproducibility of results, set random_state=9 for the split
Time the process for splitting, training and making predictions

# Classification - complete Iris dataset

# Your code here

Great, so you can see that we are able to classify the data with 100% accuracy in the given time. Remember the time taken may be different based on the load on your CPU and number of processes running on your machine.

Now repeat the above process for the dataset made from principal components:

Run a KNeighborsClassifier to classify the Iris dataset with principal components
Use a train/test split of 80/20
For the reproducibility of results, set random_state=9 for the split
Time the process for splitting, training and making predictions

# Classification - reduced (PCA) Iris dataset

Although some accuracy is lost in this representation of the data, we were able to use half of the number of features to train the model!

In more complex cases, PCA can even improve the accuracy of some machine learning tasks. In particular, PCA can be useful to reduce overfitting.

Visualize the Learned Decision Boundary

Run the cell below to visualize the decision boundary learned by the k-nearest neighbor classification model trained using the principal components of the data.

# Plot decision boundary using principal components 
import numpy as np 
def decision_boundary(pred_func):
    
    # Set the boundary
    x_min, x_max = X.iloc[:, 0].min() - 0.5, X.iloc[:, 0].max() + 0.5
    y_min, y_max = X.iloc[:, 1].min() - 0.5, X.iloc[:, 1].max() + 0.5
    h = 0.01
    
    # Build meshgrid
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot the contour
    plt.figure(figsize=(15,10))
    plt.contourf(xx, yy, Z, cmap=plt.cm.afmhot)
    plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, cmap=plt.cm.Spectral, marker='x')

decision_boundary(lambda x: model.predict(x))

plt.title('decision boundary');

Summary

In this lab, you applied PCA to the popular Iris dataset. You looked at the performance of a simple classifier and the impact of PCA on the accuracy of the model and the time it took to run the model. From here, you'll continue to explore PCA at more fundamental levels.

learn-co-students / dsc-pca-in-scikitlearn-lab-online-ds-pt-100719 Goto Github PK

dsc-pca-in-scikitlearn-lab-online-ds-pt-100719's Introduction

Principal Component Analysis in scikit-learn - Lab

Introduction

Objectives

Iris dataset

PCA Projection to 2-D Space

Visualize Principal Components

Explained Variance

Compare Performance of a Classifier with PCA

Visualize the Learned Decision Boundary

Summary

dsc-pca-in-scikitlearn-lab-online-ds-pt-100719's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent