Giter Site home page Giter Site logo

mohamadalissa / bank_customer_churn_prediction Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 3.82 MB

This project aims to predict the customers who might leave the bank based on the historical data provided.

License: Apache License 2.0

Jupyter Notebook 100.00%

bank_customer_churn_prediction's Introduction

Bank Customer Churn Prediction

This project aims to predict the customers who might leave the bank based on the historical data provided.

The Work Plan ๐Ÿ“‹

Two pipelines are investigated: 1) Short Pipeline and 2) Long Pipeline with different preprocessing steps.

  1. Short Pipeline
  • Reading the dataset.

  • Splitting the data into training data (historical data) and testing data (future data).

  • Encoding the categorical features including the ordinal and not ordinal ones.

  • Modelling:

    • Automatic tunning using Bayesian optimisation.
    • Cross-validation to check the generalisation error.
    • Training then evaluating the XGB classifier using different metrics.
  • Results Discussion.

  1. Long Pipeline
  • Reading the dataset.
  • Splitting the data into training data (historical data) and testing data (future data).
  • Encoding the categorical features including the ordinal and not ordinal ones.
  • Imputations.
  • One-Hot Encoding.
  • Analysing the dataset (Descriptive statistics, Checking the correlation).
  • Standardisation.
  • Modelling:
    • Automatic tunning using Bayesian optimisation.
    • Cross-validation to check the generalisation error.
    • Training then evaluating different classifiers using different metrics.
  • Results Discussion.

Questions & Answers ๐Ÿ“

  1. Which machine learning models did you consider and why?

To determine the best approach for this classification problem, I considered different classical machine learning techniques including a) tree-based models (XGBClassifier and Random Forest); b) distance-based models ( Support Vector Machine) and probability-based model (Naive Bayes and Logistic Regression). These models have different behaviour in handling the classification problems. The tree-based models rely on a series of conditional statements to partition the data points into subsets. The distance-based models classify data points by the dissimilarity between them as measured by distance functions. Finally, the probability-based models, like Naive Bayes, rely on conditional probability to determine the likelihood of a specific class given a set of independent features.


  1. What are the pros and cons of using each of these models?
  • Logistic Regression

Pros

*   Simple algorithm.
*   Interpretable.
*   Less prone to over-fitting (with low-dimensional data).
*   Best option  when data is linearly separable.

Cons

*   Sensitivity to outliers.
*   Require data preparation such as scale, normalize data and one-hot encoding that might be computationally expensive.
*   Bad option when data has complex relationships.
  • Random Forest

Pros

* Interpretable (relatively).
* No need to scale and normalize data (less effort in regards to preprocessing).
* Performs well with imbalanced datasets.
* Performs well with big data.
* Outliers have less impact.
* Can be used for feature selection since it provides the feature importance.

Cons

* Comparing to other models, takes a longer time to train and expensive complexity wise.
  • SVM

Pros

* Outliers have less impact.
* Performs well with high dimensions ( a large number of features).
* Very good option with binary classification and NLP applications (e.g. text classification).

Cons

* Requires data preparation such as scale and normalize data.
* Comparing to other models, takes a longer time to train.
  • NB

Pros

* Very fast (suitable for real-time prediction).
* Performs well with big data.
* Insensitive to irrelevant features.
* Performs well with high dimensions ( a large number of features).
* Good for multi-class prediction.

Cons

* Assumption of the features independency doesn't hold.
*  Training data should be representative.
  • XGBoost

Pros

* No need to scale and normalize data (less effort in regards to preprocessing).
* Can be used for feature selection since it provides the feature importance.
* Outliers have less impact.
* Performs well with large-sized datasets.
* Less prone to overfitting.
* Best option for almost any classification problem.

Cons

* Too many hyperparameters (harder to tune).

  1. What criteria did you use to decide which model was the best?

After the initial dataset analysis, I expect the distance-based technique would be the best (precisely XGBoost). However, it is worth evaluating the other techniques. All the models have been evaluated using a range of metrics including:

  • Accuracy
  • F1
  • Jaccard
  • Kappa
  • Log_loss
  • Classification report (recall and precision)
  • Confusion matrix
  • Cross-validation 10-fold (Kappa and F1 scores)

Mainly I used Kappa (Cohen's kappa coefficient). Kappa provides a measure of statistical agreement between the predicted class and the actual class that takes into consideration the probability that the model classifies correctly by chance. Kappa values > 0.75 are considered to be a strong indicator that the classifier's performance is excellent; between 0.4 to 0.75 is considered as fair to good; lower than 0.4 is understood to be weak. I chose XGBoost as the best in this situation based on these metrics and the pros mentioned in the previous question. However, if the tunning would be a big problem, then random forest lays itself as a very good candidate as well.


  1. Would you change your solution in any way if the input dataset was much larger(e.g. 10GB)?

For evaluating and selecting the best model, I would sample from this big dataset a representative dataset so I can work easily.

For training the final model, I would use progressive data loading techniques. Some of the other stages might be too computationally expensive and I might look for alternatives. Other than this, I think XGBoost and random forest can handle large size datasets. There's always room to optimise the pipeline more.


  1. How would you approach this challenge if the input dataset could not fit in your device memory?

I would use a big data platform such as Apache Spark which is distributed processing system that set up a cluster of multiple nodes.

Apart from Spark, these some ways to handle this problem:

  • Use a computer with more memory ๐Ÿ˜€.
  • Work with a smaller sample.
  • Change the data format.
  • Stream data or use progressive loading.
  • Use a relational database.

  1. What other considerations would need to be taken into account if this was real customer data?

Generally, real-world data is messy and noisy and can break the trained models in weird ways. Therefore, I think the data pipeline would include more steps to catch outliers that might affect the model performance in production. Each stage in the pipeline should consider uncertainty and ambiguous values and prepare a proper way to handle them, i.e. an automated data validation tooling with a confidence level. Also, sensitive customer data should be considered in proper ways.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.