This project explores different machine learning models to determine which may be best to identify exoplanets from NASA data.
This repository contains four Jupyter notebooks—one per model—and copy of the source data in CSV format. If you'd like to experiment with these notebooks, simply git clone
or download this repo to a computer that has pandas
, scikit-learn
, and Jupyter
installed.
Two types of models were chosen, based on their ability to assist in feature selection or accurately classify values given a training set.
- Feature selection:
Decision Tree Classifier
andRandom Forest Classifier
- Classification:
SVC
(a type of Support Vector Machine) andK Nearest Neighbor
- Determined feature importances by running
Decision Tree
andRandom Forest
models. The features that showed the most promise were:'koi_fpflag_co', 'koi_fpflag_nt', 'koi_fpflag_ss', 'koi_model_snr', 'koi_prad', 'koi_prad_err2', 'koi_duration_err2'
. - Used the features to train the
K Nearest Neighbors
andSVC
models. - Used
GridSearchCV
to hypertune the models and boost performance. - Tested the models (adding and removing features, running model with and without GridSearchCV) to confirm that feature selection was appropriate.
Model accuracy began at ~0.60
. Using only important features identified by the Random Forest model, it improved to 0.857
. This was achieved using a train-test split of 80-20 and the default SVC kernel (rbf
). Using linear
and poly
kernel settings were resource and time intensive.
SVC appears to be the better choice for potential planet classification, having especially high precision for identifying false positives.
Achieved model accuracy of 0.866
with k=17
, using important features identified by the Random Forest model.
The K Nearest Neighbors model is much faster than SVC, which may make it a better model to use on larger datasets. However, for this dataset (which is relatively small), speed is not as much of an issue.
While the model accuracy is slightly higher overall, the precision for classifying different planet types indicated in the classification_report
is lower.
- Revisit
support
in both models'classification_reports
to determine if sample rebalancing is needed - Test other models
- Visualize results of all models