chicago / food-inspections-evaluation Goto Github PK

This repository contains the code to generate predictions of critical violations at food establishments in Chicago. It also contains the results of an evaluation of the effectiveness of those predictions.

Home Page: http://chicago.github.io/food-inspections-evaluation/

License: Other

R 3.88% HTML 95.85% TeX 0.28%

chicago data-science open-data open-science food-poisoning cdph public-health

food-inspections-evaluation's Introduction

Food Inspections Evaluation

This is our model for predicting which food establishments are at most risk for the types of violations most likely to spread food-borne illness. Chicago Department of Public Health staff use these predictions to prioritize inspections. During a two month pilot period, we found that that using these predictions meant that inspectors found critical violations much faster.

You can help improve the health of our city by improving this model. This repository contains a training and test set, along with the data used in the current model.

Feel free to clone, fork, send pull requests and to file bugs. Please note that we will need you to agree to our Contributor License Agreement (CLA) in order to be able to use any pull requests.

Original Analysis and Reports

In an effort to reduce the public’s exposure to foodborne illness the City of Chicago partnered with Allstate’s Quantitative Research & Analytics department to develop a predictive model to help prioritize the city's food inspection staff. This Github project is a complete working evaluation of the model including the data that was used in the model, the code that was used to produce the statistical results, the evaluation of the validity of the results, and documentation of our methodology.

The model evaluation calculates individualized risk scores for more than ten thousand Chicagoland food establishments using publically available data, most of which is updated nightly on Chicago’s data portal. The sole exception is information about the inspectors.

The evaluation compares two months of Chicago’s Department of Public Health inspections to an alternative data driven approach based on the model. The two month evaluation period is a completely out of sample evaluation based on a model created using test and training data sets from prior time periods.

The reports may be reproduced compiling the knitr documents present in ./REPORTS.

REQUIREMENTS

All of the code in this project uses the open source statistical application, R. We advise that you use R version >= 3.1 for best results.

Ubuntu users may need to install libssl-dev, libcurl4-gnutls-dev, and libxml2-dev. This can be accomplished by typing the following command at the command line: sudo apt-get install libssl-dev libcurl4-gnutls-dev libxml2-dev

The code makes extensive usage of the data.table package. If you are not familiar with the package, you might want to consult the data.table [FAQ available on CRAN] (http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf).

FILE LAYOUT

The following directory structure is used:

DIRECTORY	DESCRIPTION
`.`	Project files such as README and LICENSE
`./CODE/`	Sequential scripts used to develop model
`./CODE/functions/`	General function definitions, which could be used in any script
`./DATA/`	Data files created by scripts in `./CODE/`, or static
`./REPORTS/`	Reports and other output are located in

We have included all of the steps used to develop the model, evaluate the results, and document the results in the above directory structure.

The scripts located in the ./CODE/ folder are organized sequentially, meaning that the numeric prefix indicates the order in which the script was / should be run in order to reproduce our results.

Although we include all the necessary steps to download and transform the data used in the model, we also have stored a snapshot of the data in the repository. So, to run the model as it stands, it is only necessary to download the repository, install the dependencies, and step through the code in CODE/30_glmnet_model.R. If you do not already have them, the dependencies can be installed using the startup script CODE/00_Startup.R.

DATA

Data used to develop the model is stored in the ./DATA directory. Chicago’s Open Data Portal. The following datasets were used in the building the analysis-ready dataset.

Business Licenses
Food Inspections 
Crime
Garbage Cart Complaints
Sanitation Complaints
Weather
Sanitarian Information

The data sources are joined to create a tabular dataset that paints a statistical picture of a ‘business license’- The primary modelling unit / unit of observation in this project.

The data sources are joined (in SQLesque manner) on appropriate composite keys. These keys include Inspection ID, Business License, and Geography expressed as a Latitude / Longitude combination among others.

Acknowledgements

This research was conducted by the City of Chicago with support from the Civic Consulting Alliance, and Allstate Insurance. The City would especially like to thank Stephen Collins, Gavin Smart, Ben Albright, and David Crippin for their efforts in developing the predictive model. We also appreciate the help of Kelsey Burr, Christian Hines, and Kiran Pookote in coordinating this research project. We owe a special thanks to our volunteers from Allstate who put in a tremendous effort to develop the predictive model and allowing their team to volunteer for projects to change their city. This project was partially funded by an award from the Bloomberg Philanthropies' Mayors Challenge.

food-inspections-evaluation's People

Contributors

Stargazers

Watchers

Forkers

fgregg bcboggess davclark cavaunpeu mwleeds zmon nlavee codeforbirmingham gibrancf mahout83 ayacha coej mountmckinney ekoziol neel17 lfcampos aglne jlarionov kar259 iustenko cash skywalkerds smartinsightsfromdata perfettiful ddubs133 aliaosha codeforanchorage pengmiao2014 rajshah4 kingsidecastle nd1 xuyanan1 chaabni jasonabr liujisheng ukapsime r1v1jusp kcor-dead raf8 dingjunhui vicyqs onurbabat dashingquark henfee nprem9 thingdiputra arasharchor kahultman datascientist44 aitmlouk dkillian merico34 voltek62 projectafey myeyeisopen seansplan jonimatix edkng marceldb socratesk milaslava rahulpanwarsingh austensen nxsyy55 dilobandaaceh ergestnako moacybarros emdecroon alishapgarg rah0789 kellycatherwood o3dwade rjohn360 riccardo-ravaro gridl vikki03 echowoo live2pro ghltshubh stephlockesample evuss mitrahajigholi tovehjelm rhofvendahl vimalrajat chekraoui epius gianfoss samhith123 devashish2002 vingkan xiaoqitang nmarchio bhanditz meiqi-wu aseelalawadh melissaoliveira quedogf94 nmnsgr muchem

food-inspections-evaluation's Issues

Fix penalty in glmnet model

Currently the penalty in the model is ifelse(grepl("^Inspector.Assigned", colnames(mm)), 1, 0) however it should be ifelse(grepl("^Inspector.Assigned", colnames(mm)), 1, 0) because the Inspector.Assigned variable name was changed some time ago before entering the model matrix.

Banner image does not display on iOS devices

Confirmed on iPad and iPhone. Appears to be a problem with the type of effect I'm using for parallax. Some documentation points out the use of jQuery is better, so will explore those options.

Fonts on different computers on gh-pages

It doesn't appear the League Spartan font is appearing appropriately on all computers. Need to further test and fix this.

(I think it's just a problem on how the font is being referenced in the CSS).

Create CONTRIBUTING document

Create a CONTRIBUTING.md document that describes the minimum contributing guidelines for contributors. Some thoughts on potential content:

Outline the GitHub issues is the best way to submit an issue.
Perhaps outline parameters we seek if someone is attempting to improve the model. Such as describing how to show their model is effective.

Any other thoughts?

Archive releases on Zenodo

Streamline and package "data update" step

Right now this process assumes a manually created single point in time download.

We need a way to refresh / add new data. This should be accomplished by adding function(s) that encompass the entire data step.

It makes sense to tackle some other issues in this step:

Fix Weather import process (issue 32)
Convert latitude to numeric during import (issue 31)
Add "Cleaned" facility type to data (issue 30)
Merge logic (issue 22) This issue is almost a repeat of the current issue (38)
Remove old data files and put into top level DATA folder (new issue)

Update training data for new forecasts

Updating training data up through 8/31/14 to refit the model.
Conduct retrodiction for 9/1/14 through 10/31/14

Update gh-pages with newest numbers

Need to update the following

main number referenced at top
interactive bar graph

Create graphs summarizing findings

A few graphs that should be created:

% of critical violations found vs % of inspections conducted (for both the actual and simulated results)
Number of critical violations by month for actual and simulated results

Incorporate data.table and new project structure

Refactor code to

Incorporate data.table throughout project for more efficient data management and handling
Incorporate convention of putting functions in separate files
Incorporate convention of creating a fresh initialization for each script
Incorporate convention of ordering scripts according to process
Move process scripts into functions
Match process output of new process to "DATA/recreated_training_data_20141103v02.Rdata"

Reproducible documentation that summarizes evaluation findings

Should summarize the findings in a reproducible document (e.g., knitr) that summarizes the findings.

Let's use checkboxes to flag items that should be in the final document. Here are some of my items (revised on 2/3):

Supersedes issue #5

Add Contributor License Agreement

Use CLAHub (using [email protected] account) to create a CLA using this language.
Add note on signing CLA to CONTRIBUTING.md file.

Add Google Analytics code to gh-pages

[Request] Download function: track timestamp

It would be nice to have the ability to record when something was downloaded so that we could track versions of the data and/or use an "as of" date in our reports.

Reorder sharing icons

Facebook | Twitter | LinkedIn | Google | GitHub

Remove unnecessary terminal output from 12_Merge.R

See comment https://github.com/Chicago/food-inspections-evaluation/pull/11/files#r20760678

Change liveReadInFoodInspections.R to a function

Right now, the code references the source code, which conducts a number of operations. It would be more useful to change this into a function which uses user-defined parameters for the data range. For instance:

foodInspect <- liveReadInFoodInspections(start.date="%Y-%m-%d", end.date="%Y-%m-%d")

Rewrite functions to generate data

Right now, several R files several other R files which download and organize data. The dates are hardcoded, which is a bit difficult to generate data over different dates. It would be more useful to change this into a function which uses user-defined parameters for the data range. For instance:

foodInspect <- liveReadInFoodInspections(start.date="%Y-%m-%d", end.date="%Y-%m-%d")

Several functions need to be written so we pass a date parameter to retrieve data.

recreate_training_data.R
create_out-of-sample_data.R
food-inspections-evaluation.R
addWeather.R

This supercedes #3

[Request] Break out logic in 12_Merge step

The calculated variables should be put into separate functions or scripts, and the merge should just merge in the output form a function or the precalculated information based on the script.

Basis for data: Food inspections

Likely candidates for new scripts:

Convert Latitude to numeric in the import step, rather than in the merge step

These lines belong in the import step:

## (THIS SHOULD HAPPEN IN THE 10 IMPORT STEP)
foodInspect[ , Latitude := as.numeric(Latitude)]
## FIX SANITATION LATITUDE
## (THIS SHOULD HAPPEN IN THE 10 IMPORT STEP)
sanitationComplaints[ , Latitude := as.numeric(Latitude)]
## FIX GARBAGE CART LATITUDE
## (THIS SHOULD HAPPEN IN THE 10 IMPORT STEP)
garbageCarts[ , Latitude := as.numeric(Latitude)]

Add section for "open source" on gh-pages

Heading and Sub-Heading text misbehaving in Safari on OS X

Safari behaviour: https://www.dropbox.com/s/o5j5pa7gmxbfwux/Recording%20%232.mp4?dl=0

Expected behavior: https://www.dropbox.com/s/yjyvg98h08a6tyw/Recording%20%233.mp4?dl=0

Location of technical article?

This is annoying late in the game, but proposing a conversation on where to place the article in the repo. Right now it's placed in the root. However, there are a number of files/artifacts for this article (as opposed to REPORTS).

.
├── (other files)
├── REPORTS/
├── article
|   ├── forecasting-restaurants-with-critical-violations-in-Chicago.Rmd (Kntir doc)
|   ├── forecasting-restaurants-with-critical-violations-in-Chicago.pdf (PDF)
|   ├── forecasting-restaurants-with-critical-violations-in-Chicago.doc (Word)
|   ├── forecasting-restaurants-with-critical-violations-in-Chicago.html (Webpage)
|   ├── forecasting-restaurants-with-critical-violations-in-Chicago.bib (BibTeX of references)
|   └──  forecasting-restaurants-with-critical-violations-in-Chicago-citation.bib (PROPOSED: BibTeX of citation for what we write)

As you can see, I am pointing-out the inclusion of BibTeX file needed to generate the article's bibliography. I'm also proposing a BibTeX file that shows how to cite the article we have prepared.

Thoughts?

Update README file

Need to update this to include the relevant files, license (MIT), and execution instructions.

Include chi_dp_downloader.R in README

See this comment https://github.com/Chicago/food-inspections-evaluation/pull/11/files#r20760611

Restore previous results (from Allstate) and reconcile

Recreate the previous results from October (code and possibly data) in a folder for reference.

Reconcile the current refactored code to the previous results.

Proposed edits to technical document

There are several changes that I would like to propose for the document, and updates, so creating a new issue for that work (this branch will be a descendant of issue 47).

Reword the summary to focus on "days sooner" rather than percent improvement.
Add mentions of the model and software to the introduction
Add subsections
Fix typos
Update formulas
Experiment with citation representation
Add citation for mass package
Describe the KDE method for heat variables

[Bug] Fix order in 30 model step

dat is not ordered correctly, which causes problems when comparing results.

[Request] Download function: Implement where clause

Allow the user to limit their download to subset by rows.
(Possibly this belongs in R Socrata)

https://github.com/Chicago/food-inspections-evaluation/pull/11/files#r20760611

Side scrolling on mobile devices

The tech plan stamp is too wide and causes suffer scrolling. Need to reduce the width of it to fit latterly.

Move items in TODO to GitHub issues

https://github.com/Chicago/food-inspections-evaluation/pull/11/files#r20760611

Remove redundant functions

Prune redundant / unused functions.

Add "tweet" button to gh-pages

Likely in the upper right-hand corner. Button will be similar to:

Add link to GitHub code from gh-pages

Provide a link to the GitHub page of the food inspection code (i.e., master branch). Likely in the upper right-hand corner.

Update README.md

Check that the README.md reflects the most recent steps, and is complete with the current process.

Fix sloppy weather code in 12 merge

The weather data is being read in at the start of the file, but then overwritten later on with a combination of weather and weather_new.

Although there is not a problem with the functionality, this is bad practice and should be fixed. The right thing would be to keep the objects separate (e.g. use weather_old and weather_new then combine them into weather), or at least combine them right after reading them in.

gh-pages nav bar taller than defined

Unsure why the navbar is appearing taller than defined in the CSS. See below showing the difference between what is defined in the CSS and what is in the browser.

Mask tokens from source code

See comment https://github.com/Chicago/food-inspections-evaluation/pull/11/files#r20760295

[Bug] Fix factor handling

We need to track the factor levels and make sure that they are consistent across different data partitions, and with out of sample data. Right now this isn't a bug, but it will be in production.

I actually ran into a problem last week that I thought was data.table, but it was actually just general "factor handling": Rdatatable/data.table#967

Here's a general example:
I run into these problems a lot, especially when predicting on out of sample data. For example, if a model had "AZ" "AR" "CA" "CO" in a column called "US_STATE", but the prediction data only has "CA" in "US_STATE", then CA will be coded to a factor of 1 if you don't also set the levels. In the model the factor level of 1 gets treated as AZ!

Remove whitespace from 00_Startup.R

See comment https://github.com/Chicago/food-inspections-evaluation/pull/11/files#r20760465

Add acknowledgement to README

Need to acknowledge previous contributors, including Allstate and others.

Remove unnecessary terminal outputs from 12_Merge.R

Going to consolidate any formatting issues for 12_Merge.R in this singular issue

Possibly download weather data from NOAA

Branch / issue for experimenting with NOAA download

Correct type-o's in gh-pages

Fix the following issues:

Remove references to facility type and risk variables
Update bar chart with revised percentages

Test dev branch before merge to master

Code should be ran from beginning to end to ensure it can run with no errors. The following components should be tested:

Generate training data
Generate out-of-sample data
Execute GLMnet code

Model diagnostics

Develop ability to test new models, and see diagnostic results in a GUI (shiny)

Cannot execute sqldf code on server

The recreate_training_data.R file uses sqldf to perform some SQL operations (joining and filtering). Since we're using R-3.0.2, the latest version of SQLDF was incompatible (requires R >= 3.1). I included scripts to manually install it which was included in 5b52d79

However, appears we have some driver issues, which may be related to the nature of the install. Specifically, receive this error message:

Error in validObject(.Object) : 
  invalid class “SQLiteDriver” object: invalid object for slot "Id" in class "SQLiteDriver": got class "integer", should be or extend class "externalptr"
Error in !dbPreExists : invalid argument type

I'll be switching to a Mac to do the intermediate work. If we cannot get sqldf() to work, we can use the merge() function.

Mask food inspector identity

Find a way to represent the food inspector information without being individually identifiable, and without dramatically impacting model quality.

Clean up facility type

Use this code to convert facility type to something that is more like the original (and is more manageable than the original)

dat[,.N,Facility_Type][order(-N)]
dat[,.N,Facility_Type][order(-N), Facility_Type]

dat[grep("restaurant", Facility_Type, ignore.case=T), Facility_Type_Clean:="Restaurant"]
dat[grep("grocery", Facility_Type, ignore.case=T), Facility_Type_Clean:="Grocery Store"]
dat[which(is.na(Facility_Type_Clean)), Facility_Type_Clean:="Other"]
dat[,.N,Facility_Type_Clean][order(-N)]

Remove str() outputs from 10_download_data.R