- General Info
- Technologies
- Data Exploration
- Feature Preprocessing
- Model Training and Results Evaluation
- Feature Importance
I use supervised learning models to identify customers who are likely to churn in the future. Furthermore, I will analyze top factors that influence user retention
Project is created with:
- Jupyter Notebook: 6.0.3
- Python 3
print ("Num of rows: " + str(churn_df.shape[0])) # row count
print ("Num of columns: " + str(churn_df.shape[1])) # col count
# check data info
churn_df.info()
# check the unique values for each column
churn_df.nunique()
About 20% of customers are labeled churn, which is imbalanced.
I check the numerical and categorical feature distribution as well as correlation of features.
numerical feature distribution:
categorical feature distribution:
correlation of features:
# Reserve 20% for testing
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, stratify = y, random_state=1)
I use logistic regression, KNN and Random Forest to build machine learning models and train them. And I use 5-fold Cross Validation to get the accuracy for different models. Random Forest has the highest accuracy, which is 0.8628.
model_names = ['Logistic Regression','KNN','Random Forest']
model_list = [classifier_logistic, classifier_KNN, classifier_RF]
count = 0
for classifier in model_list:
cv_score = model_selection.cross_val_score(classifier, X_train, y_train, cv=5)
print(cv_score)
print('Model accuracy of ' + model_names[count] + ' is ' + str(cv_score.mean()))
count += 1
I use grid search to find optimal hyperparameters for each models.
# Possible hyperparamter options for Random Forest
# Choose the number of trees
parameters = {
'n_estimators' : [40,60,80]
}
Grid_RF = GridSearchCV(RandomForestClassifier(),parameters, cv=5)
Grid_RF.fit(X_train, y_train)
TP: correctly labeled real churn
Precision(PPV, positive predictive value): tp / (tp + fp); Total number of true predictive churn divided by the total number of predictive churn; High Precision means low fp, not many return users were predicted as churn users.
Recall(sensitivity, hit rate, true positive rate): tp / (tp + fn) Predict most postive or churn user correctly. High recall means low fn, not many churn users were predicted as return users.
The best model is Random Forest. Accuracy is: 0.86 precision is: 0.78 recall is: 0.44
Confusion Matrix:
ROC Curve:
Feature importance ranking by Random Forest Model:
Feature | Importance |
---|---|
Seconds | 301 |
Age | 0.2404 |
EstimatedSalary | 0.1466 |
CreditScore | 0.1433 |
Balance | 0.1424 |
NumOfProducts | 0.1296 |
Tenure | 0.0816 |
IsActiveMember | 0.0395 |
Geography_Germany | 0.0217 |
Gender | 0.0185 |
HasCrCard | 0.0185 |
Geography_France | 0.0095 |
Geography_Spain | 0.0085 |