Following the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology, this project undertook data processing and developed multiple classification models to forecast whether a rookie player would continue playing in the NBA league for at least five years. These models encompassed Logistic Regression, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Random Forest, AdaBoost, and XGBoost. The top-performing classifiers, Logistic Regression and XGBoost, were identified based on key performance metrics, including ROC-AUC scores and Confusion matrix.
- Amy Yang
- Chanthru Vimalasri
- Yatindra Vegunta
โโโ README.md <- README file with project details.
|
โโโ data
โย ย โโโ external <- Data from third party sources.
โย ย โโโ interim <- Intermediate data that has been transformed.
โย ย โโโ processed <- Including training and validation sets.
โย ย โโโ raw <- Including 2022_train.csv and 2022_test.csv files.
โ
โโโ models <- Trained and serialized models, model predictions, or model summaries
โ
โโโ notebooks <- Jupyter notebooks. Including the data preprocessing and two best models.
โ
โโโ reports <- Generated analysis as HTML, PDF, LaTeX, etc.
โย ย โโโ figures <- Generated graphics and figures to be used in reporting
โ
โโโ requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
โ generated with `pip freeze > requirements.txt`
โ
โโโ setup.py <- makes project pip installable (pip install -e .) so src can be imported
|
|
โโโ src <- Source code for use in this project.
ย ย โโโ __init__.py <- Makes src a Python module
โ
ย ย โโโ data <- Scripts to download or generate data
ย ย โย ย โโโ sets.py
โ
ย ย โโโ features <- Scripts to turn raw data into features for modeling
ย ย โย ย โโโ build_features.py
โ
|
โโโ models <- Scripts to train models and then use trained models to make
โ predictions
ย ย ย ย โโโ null.py
ย ย ย ย โโโ performance.py
Note: The project organisation above is adapted with the cookiecutter data science project template.
- Feature engineering
- Imputation methods such as single imputation by using mean/median, multiple imputation and Nearest neighbour imputation
- Imbalance data treatment including oversampling, undersampling, STOME and hyperparameter setting
- Model training with the packages including lazypredict and scikit-learn
- Hyperparameter tuning with random search, grid search and automatic search using the Hyperopt package
- Model evaluation with ROC-AUC score and Confusion Matrix plot
Kaggle Competition [UTS AdvDSI 2022-11] NBA Career Prediction