Giter Site home page Giter Site logo

jonkwiatkowski / unsupervised-machine-learning Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 167 KB

Use unsupervised machine learning models to find better ways to predict myopia and nearsightedness.

Jupyter Notebook 100.00%
data-science k-means-clustering machine-learning python

unsupervised-machine-learning's Introduction

Unsupervised Machine Learning

Myopia Clusters

In this assignment, you’ll apply what you learned about unsupervised learning by fitting data to a model and using clustering algorithms to place data into groups. Then, you’ll create a visualization that shares your findings.

Background

You are on the data science team of a medical research company that’s interested in finding better ways to predict myopia, or nearsightedness. Your team has tried—and failed—to improve their classification model when training on the whole dataset. However, they believe that there might be distinct groups of patients that would be better to analyze separately. So, your supervisor has asked you to explore this possibility by using unsupervised learning.

You have been provided with raw data, so you’ll first need to process it to fit the machine learning models. You will use several clustering algorithms to explore whether the patients can be placed into distinct groups. Then, you’ll create a visualization to share your findings with your team and other key stakeholders.

Procedure

This activity is broken down into four parts:

  • Part 1: Prepare the Data

  • Part 2: Apply Dimensionality Reduction

  • Part 3: Perform a Cluster Analysis with K-means

  • Part 4: Make a Recommendation

Part 1: Prepare the Data

  1. Read myopia.csv into a Pandas DataFrame.

  2. Remove the "MYOPIC" column from the dataset.

    • Note: The target column is needed for supervised machine learning, but it will make an unsupervised model biased. After all, the target column is effectively providing clusters already!
  3. Standardize your dataset so that columns that contain larger values do not influence the outcome more than columns with smaller values.

Part 2: Apply Dimensionality Reduction

  1. Perform dimensionality reduction with PCA. How did the number of the features change?
  • Hint: Rather than specify the number of principal components when you instantiate the PCA model, state the desired explained variance. For example, say that a dataset has 100 features. Using PCA(n_components=0.99) creates a model that will preserve approximately 99% of the explained variance, whether that means reducing the dataset to 80 principal components or 3. For this assignment, preserve 90% of the explained variance in dimensionality reduction.
  1. Further reduce the dataset dimensions with t-SNE and visually inspect the results. To do this, run t-SNE on the principal components, which is the output of the PCA transformation.

  2. Create a scatter plot of the t-SNE output. Are there distinct clusters?

Part 3: Perform a Cluster Analysis with K-means

Create an elbow plot to identify the best number of clusters. Make sure to do the following:

  • Use a for loop to determine the inertia for each k between 1 through 10.

  • If possible, determine where the elbow of the plot is, and at which value of k it appears.

Analysis and Recommendations

The t-SNE reduction did seem to provide between two and five clusters. According to our elbow curve above, the optimal number of clusters seems to be 2 or 3. For 𝑘=2 , the clusters seem a bit more prominent to me as those were the clusters I observed in the beginning. However, the elbow curve suggests that 3 clusters may be even better. Both results are plotted above.


References

unsupervised-machine-learning's People

Contributors

jonkwiatkowski avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.