Showcase for using H2O and R for churn prediction (inspired by ZhouFang928 examples).
ZhouFang928 in a blog post Telco Customer Churn with R in SQL Server 2016 presented a great analysis of telco customer churn prediction. I found it missed one of my favorite machine-learning library H2O in the comparison. This showcase presents how easy it is to use H2O library to build very good quality predictive models.
I have used R version 3.2.3 with the following R packages:
- data.table, version 1.9.6
- h2o, version 3.10.0.6
- bit64, version 0.9-5
- pROC, version 1.8
Instalation of the packages requires Rtools compatible with your R version.
- Install packages by running
source("install_packages.R")
- Train and evaluate model by running
source("build_telco_churn_model.R")
3. After succesful model building you can find it (in H2O format) in folderexport
. It can be loaded in H2O Flow for further inspection.
I decided to go with Gradient Boosting Models. To select best model I used grid search for such parameters:
- number of trees: 50, 100, 500
- max tree depth: 4, 8, 16, 32
Best model was selected using AUC metric -- resulting in 100 trees with max depth equals 16. After model building I optimized threshold to maximize minimum per class accuracy.
Best model (with threshold selected to maximize min per class classification error) gave following results on test dataset:
- AUC = 0.947
- Accuracy = 0.866
- Precision = 0.395
- Recall = 0.875
Computation involved validating (using 5-fold cross validation) 6 GBM models with different parameters. On my laptop (Intel i7, 8GB RAM, Windows 10) it took around 25 minutes. Using Amazon's EC2 c4.4xlarge instance the time droped to around 14-15 minutes.
- Always install packages for each project separately.
- Select best model with any parametr tunning procedure.
- Do not forget to optimize threshold.
Folders:
- data - this folder contains CSV file with customers' info. It is a copy of data from ZhouFang928's example.
- libs - this folder contains packages installed by
install_packages.R
- export - this folder is for saving computing results (currently final model is stored there)
Files:
- install_packages.R - R script that installs packages into local
libs
folder - build_telco_churn_model.R - R script that does the thing
- find_best_model.R - utility function that does grid search and returns best model with the optimal threshold.