Early stage diabetes prediction using Machine Learning
Project Overview :
In this project I have predicted the chances of diabetes using Early stage diabetes risk prediction dataset.This has been collected using direct questionnaires from the patients of Sylhet Diabetes Hospital in Sylhet, Bangladesh and approved by a doctor. This dataset contains the sign and symptpom data of newly diabetic or would be diabetic patient.
The datasets consists of several medical predictor variables and one target variable, class. Predictor variables includes the Age, gender, Polyuria,Polydipsia and so on. The dataset used is available at UCI Machine Learning repository
Installations :
This project requires Python 3.x and the following Python libraries should be installed to get the project started:
I also reccommend to install Anaconda, a pre-packaged Python distribution that contains all of the necessary libraries and software for this project which also include jupyter notebook to run and execute IPython Notebook.
Code :
Actual code to get started with the project is provided in two files one is,Early Stage Diabetes Prediction.ipynb
Run :
In a terminal or command window, navigate to the top-level project directory PIMA_Indian_Diabetes/ (that contains this README) and run one of the following commands:
ipython notebook Early Stage Diabetes Prediction.ipynb
or
jupyter notebook Early Stage Diabetes Prediction.ipynb
This will open the Jupyter Notebook software and project file in your browser.
About Data
This dataset contains the sign and symptpom data of newly diabetic or would be diabetic patient.This has been collected using direct questionnaires from the patients of Sylhet Diabetes Hospital in Sylhet, Bangladesh and approved by a doctor.
Features of the dataset
The dataset consist of total 16 features and one target variable named class.
1. Age: Age in years ranging from (20years to 65 years)
2. Gender: Male / Female
3. Polyuria: Yes / No
4. Polydipsia: Yes/ No
5. Sudden weight loss: Yes/ No
6. Weakness: Yes/ No
7. Polyphagia: Yes/ No
8. Genital Thrush: Yes/ No
9. Visual blurring: Yes/ No
10. Itching: Yes/ No
11. Irritability: Yes/No
12. Delayed healing: Yes/ No
13. Partial Paresis: Yes/ No
14. Muscle stiffness: yes/ No
15. Alopecia: Yes/ No
16. Obesity: Yes/ No
Class: Positive / Negative
Steps to be Followed :
Following steps I have taken to apply machine learning models:
- Importing Essential Libraries.
- Data Preparation & Data Cleaning.
- Data Visualization (already done in early_Diabetes_Prediction_EDA.ipynb)
- Feature Engineering to discover essential features in the process of applying machine learning.
- Encoding Categorical Variables.
- Train Test Split
- Apply Machine Learning Algorithm
- Cross Validation
- Model Evaluation
Model Evaluation :
I have done model evaluation based on following sklearn metric.
- [Cross Validation Score] (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)
- [Confusion Matrix] (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
- [Plotting ROC-AUC Curve] (https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
- [Sensitivity and Specitivity] (https://en.wikipedia.org/wiki/Sensitivity_and_specificity)
- [Classification Error] (https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/)
Results :
The result of Logistic Regression and Random forest algorithms with or without feature selection is shown below.