Flow of the Project:
- Dataset obtained from Kaggle.
- Analysed the data and filter the data from the dataset.
- Identify the N/A values if any (if N/A values are present then perform the imputation methods)
- Perform class imbalance handling methods if required.
- Split/Partition the data into Training and Testing Sets.
- Implement the training algorithm.
- (Optional) Use Train Control for the above trainig algorithm.
- Perform hyper-paramater tuning.
- Implement feature Selection for more optimised results.
- Predict the results from the testing set.
- Calculate and obtain the model evalution paramters such as Accuracy, precision,....
- Draw the results in tabular format.
Class imbalance handling methods used in the project are:
- Random over sampling
- Random under sampling
- Ovum over sampling
- Ovum under sampling
- ROSE
Algorithms Implemented in the project are:
- Decision Tree
- Logistic Regression
- Random Forest
- SVM
- kNN
- Naive Bayes
Dataset/Data Splitting methods used for each datasets are:
- 75% training and 25% testing
- 80% training and 20% testing
- 70% training and 30% testing
- K-Fold
- K-Fold within 75% training and 25% testing
Feature Selection methods implemented for each mentined algorithms are:
- Correlation-based Feature Selection (CFS)
- Information Gain
- Random Forest based importance score
- Backward Elimination
- Least Absolute Shrinkage and Selection Operator (LASSO)
- Recursive Feature Elimination (RFE)
- Chi Square
- Over sampled and both sampled folders contains all the algorithms applied in the above project with the sub-folders (named after the algorithm implemented for the sub sequent files).
"DetailedReport.docx" and "Detailed Report.pdf" contains all the minor details and results (parameters: Accuracy, Precision, Specificity and Sensitivity) after applying various algorithms (such as logistic regression, decision tree, etc) and feature selection techniques with combination of various data splits.
It is important to read the "Dataset Info.txt" file is contains most valueable results drawn from the original dataset and subsequent datasets obtained after class imbalance handling.
This folder contains the folder named as:
- Decision Tree
- Logistic Regression
- Random Forest
- SVM
- kNN
- Naive Bayes
These folders contain files named after the name of the feature selction technique used..
Q) What is Class imbalance handling ?
Ans. In simple words Class imbalance handling is the ways to increase or decrease the rows or data so that the output number of 'Y' values of "yes" and "no" become nearly similar to get unbiased results.
Q) How unbiased results ?
Ans. If the model is trained on the data where the output is maximum number of times "no" then the predicted output for "yes" may be predicted sometimes inaccurate or incorrect. Similarily is when the model is trained on the data where the output is maximum number of times "yes". So, it is recommended to perform class imbalance handling in such cases to get unbiased results.