Giter Site home page Giter Site logo

minions-project-5's Introduction

#Predicting Red Hat Business Value

Team name - minions

Team members: Chirag Jain, Sharmin Pathan

Approach: To classify customer potential using a suitable prediction model.

Technologies Used:

  • Python 2.7
  • RFE
  • Logistic Regression
  • Random Forest Classifier

Introduction:

Like most companies, Red Hat is able to gather a great deal of information over time about the behavior of individuals who interact with them. They’re in search of better methods of using this behavioral data to predict which individuals they should approach—and even when and how to approach them. With an improved prediction model in place, Red Hat will be able to more efficiently prioritize resources to generate more business and better serve their customers. (This competition was hosted on Kaggle)

Problem Statement:

To create a classification algorithm that accurately identifies which customers have the most potential business value for Red Hat based on their characteristics and activities.

Datatset:

The dataset was taken from https://www.kaggle.com/c/predicting-red-hat-business-value/data. It includes people.csv, act_train.csv and act_test.csv. The people file contains all of the unique people (and the corresponding characteristics) that have performed activities over time. Each row in the people file represents a unique person. Each person has a unique people_id.

The activity files contain all of the unique activities (and the corresponding activity characteristics) that each person has performed over time. Each row in the activity file represents a unique activity performed by a person on a certain date. Each activity has a unique activity_id.

Preprocessing:

  • Convert the categorical values into numerical ones
  • Convert boolean attributes to numerical ones
  • Break date column into three separate columns, namely date, month, and year
  • Fill '-1' for the missing values
  • Merge people and activity files
  • Separate the data and labels

Flow:

  • Load the datasets into dataframes
  • Perform the preprocessing
  • Drop date, people_id, and act_id columns
  • Perform Logistic Regression for RFE. RFE uses the model accuracy to identify which attributes contribute the most to
    predicting the target attribute.
  • Use a 4-fold cross validation to have 3-folds as the training set and 1-fold for the validation set
  • Apply Random Forest classification
  • Predict outcomes for activities in the testing data set
  • Build a submission file with activities and their corresponding predicted outcomes

Execution:

Ensure the system is up with

  • RFE
  • Matplotlib

The program takes three command line arguments:

  • path to people.csv
  • path to act_train.csv
  • path to act_test.csv

For example: redHat.py people.csv act_train.csv act_test.csv

Performance:

We made a submission on Kaggle.

screen shot 2016-12-13 at 7 14 34 pm

Tuning the accuracy:

  • Tuning parameters of Random Forest classifier, namely maxDepth, no. of trees, maxBin.

Stuff we tried:

  • Plotting graphs for various attributes against the outcome to understand their distribution in the dataset
  • PCA Visualization

Challenges:

The dataset was pretty difficult to understand. Enough details were not provided on the competition page about the datasets.

####References: https://www.kaggle.com/c/predicting-red-hat-business-value

minions-project-5's People

Contributors

chirag0734 avatar sharminpathan avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.