Giter Site home page Giter Site logo

krisbitney / identifying_customer_segments Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 1.31 MB

Clustering and PCA (unsupervised learning) project. The goal is to identify customer segments by clustering customers in a reduced feature space.

HTML 63.27% Jupyter Notebook 36.48% Python 0.25%

identifying_customer_segments's Introduction

Identifying Customer Segments

An educational unsupervised learning project. The goal is to identify customer segments by clustering customers in a reduced feature space. After producing a cluster model, I compared the population of customers to the general German population to determine how segments of the general population are represented in the customer population.

For readability, I split the project into two Jupyter notebooks.

Step 1: Explore and preprocess the data

In the "Identify_Customer_Segments_Preprocessing.ipynb" Jupyter notebook, I complete the following tasks:

  1. Explored data
  2. Assessed missing values
    • Converted missing values to NaN
    • Assessed missing values by dataframe column and dropped outliers
    • Assessed missing values by dataframe row and dropped outliers
  3. Selected, re-encoded, and engineered features
  4. Created a data preprocessing pipeline

Step 2: Reduce feature space dimensionality, find clusters, and identify customer segments

In the "Identify_Customer_Segments_Clustering.ipynb" Jupyter notebook, I complete the following tasks:

  1. Imputed missing values using mean imputation (multiple imputation coming soon)
  2. Standardized features
  3. Reduced dimensionaly of feature space
    • Transformed feature space using principal components analysis (PCA)
    • Selected "best" subset of transformed features
    • Interpreted principal components using principal directions in original feature space
  4. Clustered customer data using KMeans
  5. Assigned general population to clusters and explored results

Principal components were selected based on variance explained and a hypothetical measure of the curse of dimensionality effect. Suppose our data were uniformly distributed within a unit hypersphere centered at the origin. With a given sample size and number of input features, we can calculate the median distance from the origin to a point (Hastie, Tibshirani, & Friedman, 2009).

Cluster model performance was evaluated using three metrics: mean square error, silhouette coefficient, and Calinski-Harabaz Score.

Required libraries

This project uses Numpy, Pandas, Pyplot, Seaborn, and Sklearn.

Data Source

Data was provided by AZ Direct and Arvato Finance Solution, subsidiaries of Bertelsmann. The data pertains to mail-order sales in Germany. Because the data is proprietary, I cannot publish the original data. Please review Data_Dictionary.md for more information on the data used in this project.

Acknowledgements

Guidance was provided by Udacity.

identifying_customer_segments's People

Contributors

krisbitney avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.