In this module we utilize unsupervised machine learning to fit data to a model and use clustering algorithms to place data into groups.
This activity is broken into four parts:
- Part 1: Prepare the Data
To prepare the data, we remove the MYOPIC target column that would create bias for unsupervised modeling. This column would be more beneficial for supervised modeling. We then standardize the data using the StandardScaler from sklearn.
- Part 2: Apply Dimensionality Reduction
After the data is prepared, we reduce the dataset by applying the dimentionality reduction technique PCA. This assignment calls for an n-component of 90% of the explained variance.
We further reduce the dataset dimension with t-SNE and display our results on a scatter plot.
- Part 3: Perform a Cluster Analysis with K-means
To identify the best number of clusters, we create an elbow plot for the k-means values. We achieve this by creating a for loop to determine the inertia for k between 1 through 10.
Based on the plot above, we can see that the elbow is roughly around 4.
- Part 4: Make a Recommendation
Based on our findings, we can conclude that the patients could be clustered together. The point in which our elbow plot bends is at about 4. These clusters can also be seen in the scatterplot.
- Jupyter Notebook
- Conclusion
Program may fail with recent numpy version. Downgrading to numpy 1.21.4 will fix this issue. Source