Employee-Retention-Detection

Flow of the Project:

Dataset obtained from Kaggle.
Analysed the data and filter the data from the dataset.
Identify the N/A values if any (if N/A values are present then perform the imputation methods)
Perform class imbalance handling methods if required.
Split/Partition the data into Training and Testing Sets.
Implement the training algorithm.
(Optional) Use Train Control for the above trainig algorithm.
Perform hyper-paramater tuning.
Implement feature Selection for more optimised results.
Predict the results from the testing set.
Calculate and obtain the model evalution paramters such as Accuracy, precision,....
Draw the results in tabular format.

Class imbalance handling methods used in the project are:

Random over sampling
Random under sampling
Ovum over sampling
Ovum under sampling
ROSE

Algorithms Implemented in the project are:

Decision Tree
Logistic Regression
Random Forest
SVM
kNN
Naive Bayes

Dataset/Data Splitting methods used for each datasets are:

75% training and 25% testing
80% training and 20% testing
70% training and 30% testing
K-Fold
K-Fold within 75% training and 25% testing

Feature Selection methods implemented for each mentined algorithms are:

Correlation-based Feature Selection (CFS)
Information Gain
Random Forest based importance score
Backward Elimination
Least Absolute Shrinkage and Selection Operator (LASSO)
Recursive Feature Elimination (RFE)
Chi Square

File Structure

Over sampled and both sampled folders contains all the algorithms applied in the above project with the sub-folders (named after the algorithm implemented for the sub sequent files).

"DetailedReport.docx" and "Detailed Report.pdf" contains all the minor details and results (parameters: Accuracy, Precision, Specificity and Sensitivity) after applying various algorithms (such as logistic regression, decision tree, etc) and feature selection techniques with combination of various data splits.

It is important to read the "Dataset Info.txt" file is contains most valueable results drawn from the original dataset and subsequent datasets obtained after class imbalance handling.

Scripts Folder

This folder contains the folder named as:

Decision Tree
Logistic Regression
Random Forest
SVM
kNN
Naive Bayes

These folders contain files named after the name of the feature selction technique used..

Q) What is Class imbalance handling ?

Ans. In simple words Class imbalance handling is the ways to increase or decrease the rows or data so that the output number of 'Y' values of "yes" and "no" become nearly similar to get unbiased results.

Q) How unbiased results ?

Ans. If the model is trained on the data where the output is maximum number of times "no" then the predicted output for "yes" may be predicted sometimes inaccurate or incorrect. Similarily is when the model is trained on the data where the output is maximum number of times "yes". So, it is recommended to perform class imbalance handling in such cases to get unbiased results.

pranav-patel-123 / employee-retention-detection Goto Github PK