Giter Site home page Giter Site logo

pedropatin / distributed-churn-prediction Goto Github PK

View Code? Open in Web Editor NEW

This project forked from iusztinpaul/distributed-churn-prediction

0.0 0.0 0.0 1.87 MB

End-to-end customer churn prediction pipeline using Spark.

License: MIT License

Python 1.62% Jupyter Notebook 98.38%

distributed-churn-prediction's Introduction

Churn Prediction Using Spark

Table of Contents

  1. Motivation
  2. Installation
  3. Data
  4. Usage
  5. Results
  6. Licensing, Authors, Acknowledgements

1. Motivation

This project is a tutorial on how to train a churn prediction model using entirely Spark. It contains a prototyping step within the notebook file. Also, it shows how you can ship your research code into a Spark processing file that can be shipped into production.

It uses a music streaming dataset based on a fictional company called Sparkify which contains all kinds of events created by the users who interacted with the platform.

The training is done entirely with Spark, starting from the cleaning until the training and evaluation steps.

Listened Songs Distribution

2. Installation

Install Dependencies

The dependencies are versioned by poetry. Make sure to have it installed on your system. Also, the code was tested with:

  • Ubuntu 20.04
  • Python 3.8


From the root directory run:

poetry install

3. Data

To train the models, we used a dataset provided by Udacity with user activity from a fictional streaming music company called Sparky. It consists of user events within the platform such as: Login, NextSong, Error, etc.

We considered churn users the ones that left the platform by generating an event with the page=Cancellation Confirmation.


You can download the mini Sparkify dataset from here.


The file should be placed under the root directory.

Churn Distribution

4. Usage

Files Structure

The prototyping was done within the Sparkify.ipynb Notebook. The same code was shipped in the process.py Python file, which can be run and automated with `Spark.

Both files contain the following steps:

  • Data cleaning
  • Feature engineering
  • Model training & testing

The EDA component is only within the Sparkify.ipynb Notebook. You can look at the EDA within an exported PDF at Sparkify.pdf.

Using publish.py we automatically pushed the Notebook to Medium (you can see how to set up jupyter_to_medium here).

Run

Run Notebook

Run the notebook with:

jupyter notebook .

Run Spark Script

Run the Spark Python script with:

export PYSPARK_PYTHON=`which python`
spark-submit --master localhost process.py

5. Results

The results show the F1 score on the validation and test. The validation split was computed within a cross-validation with a 3 folds step. The test split represents 20% of the initial data.

Model Validation Test
Logistic Regression 0.6958 0.5952
Naive Bayes 0.6672 0.5952
Gradient Boosting 0.7333 0.8473

The GBT model performed better than the Logistic Regression and the Naive Bayes. Probably, because it is a more complex model that can understand non-linear relationships better.

Note: You can read a detailed examination of the results on Medium.

Listened Songs Distribution

6. Licensing, Authors, Acknowledgements

The code is licensed under the MIT license. I encourage anybody to use and share the code as long as you give credit to the original author. I want to thank Udacity for its contribution to making the data available. Without their assistance, I would not have been able to train the models.

If anybody has machine learning questions, suggestions, or wants to collaborate with me, feel free to contact me at [email protected] or to connect with me on LinkedIn.

distributed-churn-prediction's People

Contributors

iusztinpaul avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.