Giter Site home page Giter Site logo

ar3441 / dsc-3-34-09-principle-component-analysis-in-scikitlearn-lab-nyc-ds-career-031119 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from learn-co-students/dsc-3-34-09-principle-component-analysis-in-scikitlearn-lab-nyc-ds-career-031119

0.0 2.0 0.0 240 KB

License: Other

Jupyter Notebook 100.00%

dsc-3-34-09-principle-component-analysis-in-scikitlearn-lab-nyc-ds-career-031119's Introduction

Pincipal Component Analysis in scikit-learn - Lab

Introduction

PCA algorithm is generally applied in dimension reduction contexts with an option to visualize a complex high dimensional dataset in 2D or 3D. PCA can also do an amazing job towards removing the computational cost of other machine learning algorithms by allowing them to train on a reduced set of features (principal components) In this lesson, we shall look into implementing PCA with scikit-learn to the popular iris dataset, in an attempt to reduce the number of dimensions from 4 to 2 and see if the reduced set of dimensions would still preserve the variance of complete dataset.

Objectives

You will be able to:

  • Perform PCA in Python and scikit-learn using Iris dataset
  • Measure the impact of PCA on the accuracy of classification algorithms
  • Plot the decision boundary of different classification experiments to visually inspect their performance.

Iris Dataset

In this post we'll see how to use Principal Component Analysis to perform linear data reduction for the purpose of data visualization. Let's load the necessary libraries and iris dataset to get us started.

Perform following steps:

# Load necessary libraries


# Your code here 
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sepal length sepal width petal length petal width target
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

So here we see a set of four input features i.e. four dimensions. Our goal for this simple analysis is to reduce this number to 2 (or 3) so that we can visualize the resulting principal components using the standard plotting techniques that we have learned so far in the course.

Standardize the Data

We have seen that PCA creates a feature subspace that maximizes the variance along the axes. As features could belong to different scales of measurement, our first step in PCA is always to standardize the feature set. Although, all features in the Iris dataset were measured on a same scale (i.e. cm), we shall still perform this step to get a mean=0 and variance=1 as a "standard practice". This helps PCA and a number of other machine learning algorithms to perform optimally. Visit Importance of feature scaling at sk-learn documentation to read more on this.

Let's create our feature and target datasets first.

  • Create a set of features with 'sepal length', 'sepal width', 'petal length', 'petal width'.
  • Create X and y datasets based on features and target variables
# Create features and Target dataset


# Your code here 

Now we can take our feature set X and standardize it using StandardScalar method from sk-learn.

  • Standardize the feature set X
# Standardize the features


# Your code here 
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sepal length sepal width petal length petal width
0 -0.900681 1.032057 -1.341272 -1.312977
1 -1.143017 -0.124958 -1.341272 -1.312977
2 -1.385353 0.337848 -1.398138 -1.312977
3 -1.506521 0.106445 -1.284407 -1.312977
4 -1.021849 1.263460 -1.341272 -1.312977

PCA Projection to 2D Space

We shall now project the original data which is 4 dimensional into 2 dimensions. Remember, there usually isn’t a particular meaning assigned to each principal component. The new components are just the two main dimensions of variance present in the data. To perform PCA with sk-learn, we need to import it first and create an instance of PCA while defining the number of principal components.

  • Initialize an instance of PCA from scikit-learn with 2 components
  • Fit the data to the model
  • Extract the first 2 principal components from the trained model
# Run the PCA algorithm


# Your code here 

We can now save the results in a new dataframe and name the columns according the first/second component.

  • Append the target (flower name) to the principal components in a pandas dataframe
# Create a new dataset fro principal components 


# Your code here 
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
PC1 PC2 target
0 -2.264542 0.505704 Iris-setosa
1 -2.086426 -0.655405 Iris-setosa
2 -2.367950 -0.318477 Iris-setosa
3 -2.304197 -0.575368 Iris-setosa
4 -2.388777 0.674767 Iris-setosa

Great, we now have a set of two dimensions, reduced from four against our target variable, the flower name. Let's now try to visualize this dataset and see if the different flower species remain separable.

Visualize Principal Components

Using the target data, we can visualize the principal components according to the class distribution.

  • Create a scatter plot from principal components while color coding the examples
# Principal Componets scatter plot


# Your code here 

Explained Variance

The explained variance tells us how much information (variance) can be attributed to each of the principal components

We can see above that the three classes in the dataset remain well separable. iris-virginica and iris-versicolor could be better separated, but we have to remember that we just reduced the size of dimensions to half. the cost-performance trade-off is something that data scientists often have to come across. In order to get a better idea around how much variance of the original dataset is explained in principal components, we can use the attribute explained_variance_ratio_.

  • Check the explained variance of the two principal components using explained_variance_ratio_
# Calculate the variance explained by pricipal components


# Your code here 
Variance of each component: [0.72770452 0.23030523]

 Total Variance Explained: 95.8

First two PCs contain 95.80% of the information. The first PC contains 72.77% of the variance and the second PC contains 23.03% of the variance. The third and fourth principal component contained the rest of the variance of the dataset.

Compare Performance of an Classifier with PCA

So our principal components above explained 95% of variance in the data. How much would it effect the accuracy of a classifier? The best way to answer this is with a simple classifier like KNeighborsClassifier. We can try to classify this dataset in its original form vs. principal components computed above.

  • Run a KNeighborsClassifier to classify the Iris dataset
  • Use a trai/test split of 80/20
  • For reproducability of results, set random state =9 for the split
  • Time the process for splitting, training and making prediction
# classification complete Iris dataset

# Your code here 
Accuracy: 1.0
Time Taken: 0.0017656260024523363

Great , so we see that we are able to classify the data with 100% accuracy in the given time. Remember the time taken may different randomly based on the load on your cpu and number of processes running on your PC.

Now let's repeat the above process for dataset made from principal components

  • Run a KNeighborsClassifier to classify the Iris dataset with principal components
  • Use a trai/test split of 80/20
  • For reproducability of results, set random state =9 for the split
  • Time the process for splitting, training and making prediction
# Run the classifer on PCA'd data


# Your code here 
Accuracy: 0.9666666666666667
Time Taken: 0.00035927799763157964

So we see that going from 4 actual dimensions to two derived dimensions. We manage to get an accuracy of 96%. There is some loss but considering big data domain with data possibly having thousands of features, this trade-off is often accepted in order to simplify and speed up computation. The time taken to run the classifer is much less than what we saw with complete dataset.

Bonus : Visualize Decision Boundary

visualizing decision boundary is good way to develop the intuition around a classifier's performance with 2/3 dimensional data. We can do this often to point out the examples that may not get classified correctly. It also helps us get an insight into how a certain algorithm draws these boundaries i.e. the learning process of an algorithm.

  • Draw the decision boundary for the classification with principal components (Optional - with complete dataset)
# Plot decision boundary using principal components 


# Your code here 
Text(0.5,1,'decision boundary')

png

Level Up - Optional

  • Use following classifier instead of KNN shown above to see how much PCA effects the accuracy, coming from 4 to 2 dimensions.
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
  • Use 3 principal components instead of two and re-run your experiment to see the impact on the accuracy.

Summary

In this lab we applied PCA to the popular Iris dataset. We looked at performance of a simple classifier and impact of PCA on it. NExt we shall take PCA to a more specialized domain i.e. Computer Vision and Image Processing and see how this technique can be used to image classification and data compression tasks.

dsc-3-34-09-principle-component-analysis-in-scikitlearn-lab-nyc-ds-career-031119's People

Contributors

loredirick avatar shakeelraja avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.