This project is a tutorial on how to train a churn prediction model
using entirely Spark
. It contains a prototyping step within the notebook file.
Also, it shows how you can ship your research code into a Spark processing file that can be shipped into production.
It uses a music streaming dataset
based on a fictional company called Sparkify which contains all kinds of events created by the users who interacted with
the platform.
The training is done entirely with Spark, starting from the cleaning until the training and evaluation steps.
The dependencies are versioned by poetry
. Make sure to have it installed on your system.
Also, the code was tested with:
- Ubuntu 20.04
- Python 3.8
From the root directory run:
poetry install
To train the models, we used a dataset provided by Udacity with user activity from a fictional streaming
music company called Sparky. It consists of user events within the platform such as: Login
, NextSong
, Error
, etc.
We considered churn
users the ones that left the platform by generating an event with the page=Cancellation Confirmation
.
You can download the mini Sparkify dataset from here.
The file should be placed under the root directory.
The prototyping was done within the Sparkify.ipynb
Notebook.
The same code was shipped in the process.py
Python file, which can be run and automated with `Spark.
Both files contain the following steps:
- Data cleaning
- Feature engineering
- Model training & testing
The EDA component is only within the Sparkify.ipynb
Notebook.
You can look at the EDA
within an exported PDF at Sparkify.pdf
.
Using publish.py
we automatically pushed the Notebook to Medium
(you can see how to set up jupyter_to_medium
here).
Run the notebook with:
jupyter notebook .
Run the Spark Python script with:
export PYSPARK_PYTHON=`which python`
spark-submit --master localhost process.py
The results show the F1 score
on the validation and test.
The validation split was computed within a cross-validation
with a 3 folds
step.
The test split represents 20%
of the initial data.
Model | Validation | Test |
---|---|---|
Logistic Regression | 0.6958 | 0.5952 |
Naive Bayes | 0.6672 | 0.5952 |
Gradient Boosting | 0.7333 | 0.8473 |
The GBT model performed better than the Logistic Regression and the Naive Bayes. Probably, because it is a more complex model that can understand non-linear relationships better.
Note: You can read a detailed examination of the results on Medium.
The code is licensed under the MIT license. I encourage anybody to use and share the code as long as you give credit to the original author. I want to thank Udacity for its contribution to making the data available. Without their assistance, I would not have been able to train the models.
If anybody has machine learning questions, suggestions, or wants to collaborate with me, feel free to contact me
at [email protected]
or to connect with me on LinkedIn.