Final Project for Big Data Science, Spring 2019
Cole Smith
Undergraduate
This project was written in Python 3.7. It is recommended
to set up the virtual environment with that version. If your system
defaults to Python 2, an interpreter can be specified with the --python
flag to virtualenv
To set up the environment run:
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
The clustering output can be viewed in doc/
. It can also be
generated by commenting out the code labelled as such in main.py
The predictions can be ran directly by executing: python main.py
For clarity, the prediction output for the Regression is the total amount of restaurant closures (hard and soft, see below) for a given month, given a number of factors. Each row is a zip at a month in time.
The output for the Classification is of soft (see below) closures. This is done using the Restaurant Inspections Data Set. Each row is a restaurant in current-day.
Since different datasets cannot reliably be joined, the closure information is broken out into Hard and Soft closures.
Hard Closures are those in which a restaurant did not renew its DCA license and thus cannot legally operate in New York City. These are assumed to not be re-opened, since this closure was presumably voluntary.
Soft Closures are those in which the health inspection results warrants a complete closure. This offer a richer set of supporting features since they originate from the Restaurant Inspection Dataset. However, there are generally far fewer soft closures than hard closures.
These closures are assumed to be involuntary, and restaurants may re-open upon a second inspection.
Since there is no given unique, universal identifier for a restaurant in these data sets, the only information that can be used to merge tables is the zip code and the date (Month and Year).
However, since it is assumed that Soft and Hard Closures are drawn from the same distribution (All restaurants must be inspected and must hold a DCA license), the master data set also includes information from the Restaurant Inspection Data Set aggregated to a monthly time-scale.
The total closures are therefore the sum between soft and hard closures for a given month and zip code.