Credit_Risk_Analysis

Overview of the loan prediction risk analysis:

Credit risk is an unbalanced classification problem, as good loans outnumber risky loans. Machine learning methods using imbalanced-learn and scikit-learn libraries are applied to train and test models with unbalanced classes and determine which model is superior at predicting credit risk.

The following models ran for this analysis:

Logistic Regression - Random Oversampling
Logistic Regression - SMOTE Oversampling
Logistic Regression - Undersamping
Logistic Regression - SMOTEEN Combination Over and Undersampling
Easy Ensemble Classifier
Balanced Random Forest Classifier

The following model outputs were analyzed for all 6 models.

Accuracy score : measures how many observations in the testing set were correctly predicted by the model (i.e the higher the better) Accuracy = TP+TN/TP+FP+FN+TN
Precision : measure of how reliable a positive classification is (i.e the higher the better Precision = TP/(TP + FP)
Recall/sensitivity : measure of predicted positive observations to the all observations in actual class Recall = TP / (TP + FN)
F1 Score : weighted average of recall and precision (also known as the harmonic mean) F1 Score = 2( precision * sensitivity ) / (precision + sensitivity)

Results:

1.) Naive Random Oversampling:

Accuracy: 0.654

	Predicted( High Risk)	Predicted(Low Risk)
Actual (High Risk)	73	28
Actual (Low Risk)	7,324	9,780

	Precision	Recall	F1 Score
High Risk	0.01	0.72	0.02
Low Risk	1.00	.057	0.73

2.) SMOTE Oversampling

Accuracy: 0.66

	Predicted( High Risk)	Predicted(Low Risk)
Actual (High Risk)	64	37
Actual (Low Risk)	5,296	11,808

	Precision	Recall	F1 Score
High Risk	0.01	0.63	0.02
Low Risk	1.00	0.69	0.82

3.) Undersampling

Accuracy: 0.54

	Predicted( High Risk)	Predicted(Low Risk)
Actual (High Risk)	70	31
Actual (Low Risk)	10,324	6,780

	Precision	Recall	F1 Score
High Risk	0.01	0.69	0.01
Low Risk	1.00	0.40	0.56

4.) Combination (Over and Under) Sampling SMOTEENN

Accuracy: 0.64

	Predicted( High Risk)	Predicted(Low Risk)
Actual (High Risk)	73	28
Actual (Low Risk)	7,324	9,780

	Precision	Recall	F1 Score
High Risk	0.01	0.72	0.02
Low Risk	1.00	0.57	0.73

5.) Balanced Random Forest Classifier

Accuracy: 0.79

	Predicted( High Risk)	Predicted(Low Risk)
Actual (High Risk)	71	30
Actual (Low Risk)	2,030	15,074

	Precision	Recall	F1 Score
High Risk	0.03	0.70	0.06
Low Risk	1.00	0.88	0.94

6.) Easy Ensemble Classifier

Accuracy: 0.93

	Predicted( High Risk)	Predicted(Low Risk)
Actual (High Risk)	93	8
Actual (Low Risk)	995	16,109

	Precision	Recall	F1 Score
High Risk	0.09	0.92	0.16
Low Risk	1.00	0.94	0.97

Summary:

Screenshots of results below

1.) Naive Random Oversampling:

2.) SMOTE Oversampling

3.) Undersamping

4.) SMOTEEN Combination Over and Undersampling

5.) Easy Ensemble Classifier

6.) Balanced Random Forest Classifier

Analyzing the results we can conclude the following:

1.) Accuracy: The Easy Ensemble Classifier model has the highest accuracy of 0.93,the Balanced Random Forest Classifier model has an accuracy rate of 0.79. The Undersampling model has the lowest accuracy of 0.54.

2.) Precision: All the models have an average precision of 0.99 which means that the models correctly identified the majority of high-risk loan applications.

3.)Recall: The Easy Ensemble Classifier has an average recall score of 0.94 which is the highest of all the different methods, the next best score is the Balanced Random Forest Classifier with 0.88.

4.) F1 Score: The Easy Ensemble Classifier has an average recall score of 0.97 which is the highest of all the different method which also means that there is not a huge imbalance between the precision and recall scores.

In conclusion based on the above metrics I would recommend using the Easy Ensemble Classifier method to predict the Credit Risk.

Code

Credit Risk Ensemble

Credit Risk Resampling

java2509 / credit_risk_analysis Goto Github PK