Giter Site home page Giter Site logo

kaggle_instacart's Introduction

Kaggle Instacart Classification

I built models to classify whether or not items in a user's order history will be in their most recent order, basically recreating the Kaggle Instacart Market Basket Analysis Competition. Because the full dataset was too large to work with on my older Macbook, I loaded the data into a SQL database on an AWS EC2 instance. This setup allowed me to easily query subsets of the data in order to do all of my preliminary development. I then cleaned up my work and wrote it into a script called 'build_models.py' that can be easily run through a notebook or the command line. Once I was ready to scale up to the full dataset, I simply ran the build_models script on a 2XL EC2 instance and brought the resulting models back into my 'kaggle_instacart' notebook for test set evaluation..

Feature Engineering

I spent the majority of my time on this project engineering features from the basic dataset. After creating several features, I tested different combinations of them on a small subset of the data in order to eliminate any that seemed to have no effect on model output. After paring down features I ended up training and testing my final models on the following predictors:

  • percent_in_user_orders: Percent of a user's orders in which an item appears
  • percent_in_all_orders: Percent of all orders in which an item appears
  • in_last_cart: 1 if an item appears in a user's most recent prior order, 0 if not
  • in_last_five: Number of orders in a user's five most recent prior orders in which an item appears
  • total_user_orders: Total number of orders placed by a user
  • mean_orders_between: Average number of orders between which an item appears in a user's order
  • mean_days_between: Average number of days between which an item appears in a user's order
  • orders_since_newest: Number of orders between the last user order containing an item and the most recent order
  • days_since_newest: Number of days between the last user order containing an item and the most recent order
  • product_reorder_proba: Probability that any user reorders an item
  • user_reorder_proba: Probability that a user reorders any item
  • mean_cart_size: Average user cart (aka order) size
  • mean_cart_percentile: Average percentile of user cart add order for an item
  • mean_hour_of_week: Average hour of the week that a user orders an item (168 hours in a week)
  • newest_cart_size: Number of items in the most recent cart
  • newest_hour_of_week: Hour of the week that the most recent order was placed
  • cart_size_difference: Absolute value of the difference between the average size of the orders containing an item and the size of the most recent order
  • hour_of_week_difference: Absolute value of the difference between the average hour of the week in which a user purchases an item and the hour of the week of the most recent order

Models

In my preliminary tests using subsets of the Instacart data, I trained a number of different models: logistic regression, gradient boosting decision trees, random forest, and KNN. After several rounds of testing, I took the two that performed best, logistic regression and gradient boosting trees, and trained them on the full data set, minus a holdout test set. I used F1 score as my evaluation metric because I wanted the models to balance precision and recall in predicting which previously ordered items would appear in the newest orders. To account for the large class imbalance caused by the majority of previously ordered items not being in the most recent orders, I created adjusted probability threshold F1 scores as well. The scores below treat each dataframe row, which represents an item ordered by a specific user, as a separate, equally-weighted entity. Both models performed similarly, with the gradient boosting trees classifier achieving slightly higher scores:

Model Raw F1 Score Adjusted F1 Score
Logistic Regression 0.313 0.447
Gradient Boosting Trees 0.338 0.461

I also calculated mean per-user F1 scores that more closely match the metric of the original Kaggle contest. If either model were incorporated into a recommendation engine the user-based metric would better represent its performance. In these F1 scores, model performance is virtually identical:

Model Per-User F1 Score
Logistic Regression 0.367
Gradient Boosting Trees 0.368

Results

The charts below show the most influential predictors and their respective coefficient values for each model. The logistic regression model relies heavily upon information about the size of the most recent cart, while the gradient boosting decision trees model gives far more weight to the contents of a user's previous orders. If information about the most recent cart were not available, the gradient boosting model would most likely outperform the logistic regression model.

logistic regression model coefficients gradient boosting decision trees model coefficients

Conclusions

This project was all about feature creation - the more features I engineered the better my models performed. At Metis I had a pretty tight deadline to get everything done and as a result did not incorporate all of the predictors I wanted to. I plan to eventually circle back and add more, including implementing some ideas from the Kaggle contest winners.

kaggle_instacart's People

Contributors

lwadya avatar

Stargazers

 avatar

Forkers

sshuster

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.