Mario Albuquerque
May 31th, 2018
The subject of the project, which was done individually, was taken from a Kaggle competition named "Avito Demand Prediction Challenge".
The problem is one of determining the demand for an online advertisement given heterogeneous data types (categorical, numerical, text, and image).
The solution was devised in two approaches: a supervised classification model that flagged likely and unlikely deals; and, a supervised regression model that generated a deal probability forecast.
The project used data provided through a Kaggle competition: "Avito Demand Prediction Challenge". Note that the location of the necessary files are assumed to have a root folder where the Jupyter Notebook Python files are located. There are two data files to be extracted:
-
train.csv.zip which has a file named train.csv with a total of 1,503,424 ads totaling around 931,000 KB. This is the main data source with the ads. This file should be in the folder "./Data/".
-
train_jpg.zip which has images corresponding to the ads in the train.csv dataset. There are a total of 1,390,836 images in the zipped folder and it totals around 52,000,000 KB. Note that not all ads in the train.csv dataset have an image. This file should be unzipped in the folder "./Data/Images/".
This project was done with Python 3.5.3 and needs the following packages (outside of the Python Standard Library):
- pandas 0.22.0
- numpy 1.11.3
- textblob 0.15.1
- pillow 5.0.0
- matplotlib 1.5.1
- nltk 3.2.4
- keras 2.1.4
- opencv 3.2.0
- ipython 6.1.0
- scikit-learn 0.19.1
The implementation of the project was done through three Jupyter Notebooks:
-
EDA.ipynb: Exploratory data analysis.
-
Feature Engineering.ipynb: Feature engineering.
-
Model Development.ipynb: Model development and evaluation.