Giter Site home page Giter Site logo

chicago / food-inspections-evaluation Goto Github PK

View Code? Open in Web Editor NEW
405.0 42.0 130.0 325.59 MB

This repository contains the code to generate predictions of critical violations at food establishments in Chicago. It also contains the results of an evaluation of the effectiveness of those predictions.

Home Page: http://chicago.github.io/food-inspections-evaluation/

License: Other

R 3.88% HTML 95.85% TeX 0.28%
chicago data-science open-data open-science food-poisoning cdph public-health

food-inspections-evaluation's Introduction

Food Inspections Evaluation

This is our model for predicting which food establishments are at most risk for the types of violations most likely to spread food-borne illness. Chicago Department of Public Health staff use these predictions to prioritize inspections. During a two month pilot period, we found that that using these predictions meant that inspectors found critical violations much faster.

You can help improve the health of our city by improving this model. This repository contains a training and test set, along with the data used in the current model.

Feel free to clone, fork, send pull requests and to file bugs. Please note that we will need you to agree to our Contributor License Agreement (CLA) in order to be able to use any pull requests.

Original Analysis and Reports

In an effort to reduce the public’s exposure to foodborne illness the City of Chicago partnered with Allstate’s Quantitative Research & Analytics department to develop a predictive model to help prioritize the city's food inspection staff. This Github project is a complete working evaluation of the model including the data that was used in the model, the code that was used to produce the statistical results, the evaluation of the validity of the results, and documentation of our methodology.

The model evaluation calculates individualized risk scores for more than ten thousand Chicagoland food establishments using publically available data, most of which is updated nightly on Chicago’s data portal. The sole exception is information about the inspectors.

The evaluation compares two months of Chicago’s Department of Public Health inspections to an alternative data driven approach based on the model. The two month evaluation period is a completely out of sample evaluation based on a model created using test and training data sets from prior time periods.

The reports may be reproduced compiling the knitr documents present in ./REPORTS.

REQUIREMENTS

All of the code in this project uses the open source statistical application, R. We advise that you use R version >= 3.1 for best results.

Ubuntu users may need to install libssl-dev, libcurl4-gnutls-dev, and libxml2-dev. This can be accomplished by typing the following command at the command line: sudo apt-get install libssl-dev libcurl4-gnutls-dev libxml2-dev

The code makes extensive usage of the data.table package. If you are not familiar with the package, you might want to consult the data.table [FAQ available on CRAN] (http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf).

FILE LAYOUT

The following directory structure is used:

DIRECTORY DESCRIPTION
. Project files such as README and LICENSE
./CODE/ Sequential scripts used to develop model
./CODE/functions/ General function definitions, which could be used in any script
./DATA/ Data files created by scripts in ./CODE/, or static
./REPORTS/ Reports and other output are located in

We have included all of the steps used to develop the model, evaluate the results, and document the results in the above directory structure.

The scripts located in the ./CODE/ folder are organized sequentially, meaning that the numeric prefix indicates the order in which the script was / should be run in order to reproduce our results.

Although we include all the necessary steps to download and transform the data used in the model, we also have stored a snapshot of the data in the repository. So, to run the model as it stands, it is only necessary to download the repository, install the dependencies, and step through the code in CODE/30_glmnet_model.R. If you do not already have them, the dependencies can be installed using the startup script CODE/00_Startup.R.

DATA

Data used to develop the model is stored in the ./DATA directory. Chicago’s Open Data Portal. The following datasets were used in the building the analysis-ready dataset.

Business Licenses
Food Inspections 
Crime
Garbage Cart Complaints
Sanitation Complaints
Weather
Sanitarian Information

The data sources are joined to create a tabular dataset that paints a statistical picture of a ‘business license’- The primary modelling unit / unit of observation in this project.

The data sources are joined (in SQLesque manner) on appropriate composite keys. These keys include Inspection ID, Business License, and Geography expressed as a Latitude / Longitude combination among others.

Acknowledgements

This research was conducted by the City of Chicago with support from the Civic Consulting Alliance, and Allstate Insurance. The City would especially like to thank Stephen Collins, Gavin Smart, Ben Albright, and David Crippin for their efforts in developing the predictive model. We also appreciate the help of Kelsey Burr, Christian Hines, and Kiran Pookote in coordinating this research project. We owe a special thanks to our volunteers from Allstate who put in a tremendous effort to develop the predictive model and allowing their team to volunteer for projects to change their city. This project was partially funded by an award from the Bloomberg Philanthropies' Mayors Challenge.

food-inspections-evaluation's People

Contributors

cash avatar fgregg avatar geneorama avatar mountmckinney avatar skishchampi avatar socratesk avatar thingdiputra avatar tomschenkjr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

food-inspections-evaluation's Issues

Create CONTRIBUTING document

Create a CONTRIBUTING.md document that describes the minimum contributing guidelines for contributors. Some thoughts on potential content:

  • Outline the GitHub issues is the best way to submit an issue.
  • Perhaps outline parameters we seek if someone is attempting to improve the model. Such as describing how to show their model is effective.

Any other thoughts?

[Bug] Fix factor handling

We need to track the factor levels and make sure that they are consistent across different data partitions, and with out of sample data. Right now this isn't a bug, but it will be in production.

I actually ran into a problem last week that I thought was data.table, but it was actually just general "factor handling": Rdatatable/data.table#967

Here's a general example:
I run into these problems a lot, especially when predicting on out of sample data. For example, if a model had "AZ" "AR" "CA" "CO" in a column called "US_STATE", but the prediction data only has "CA" in "US_STATE", then CA will be coded to a factor of 1 if you don't also set the levels. In the model the factor level of 1 gets treated as AZ!

Incorporate data.table and new project structure

Refactor code to

  • Incorporate data.table throughout project for more efficient data management and handling
  • Incorporate convention of putting functions in separate files
  • Incorporate convention of creating a fresh initialization for each script
  • Incorporate convention of ordering scripts according to process
  • Move process scripts into functions
  • Match process output of new process to "DATA/recreated_training_data_20141103v02.Rdata"

Clean up facility type

Use this code to convert facility type to something that is more like the original (and is more manageable than the original)

dat[,.N,Facility_Type][order(-N)]
dat[,.N,Facility_Type][order(-N), Facility_Type]

dat[grep("restaurant", Facility_Type, ignore.case=T), Facility_Type_Clean:="Restaurant"]
dat[grep("grocery", Facility_Type, ignore.case=T), Facility_Type_Clean:="Grocery Store"]
dat[which(is.na(Facility_Type_Clean)), Facility_Type_Clean:="Other"]
dat[,.N,Facility_Type_Clean][order(-N)]

Rewrite functions to generate data

Right now, several R files several other R files which download and organize data. The dates are hardcoded, which is a bit difficult to generate data over different dates. It would be more useful to change this into a function which uses user-defined parameters for the data range. For instance:

foodInspect <- liveReadInFoodInspections(start.date="%Y-%m-%d", end.date="%Y-%m-%d")

Several functions need to be written so we pass a date parameter to retrieve data.

  • recreate_training_data.R
  • create_out-of-sample_data.R
  • food-inspections-evaluation.R
  • addWeather.R

This supercedes #3

Model diagnostics

Develop ability to test new models, and see diagnostic results in a GUI (shiny)

Convert Latitude to numeric in the import step, rather than in the merge step

These lines belong in the import step:

## (THIS SHOULD HAPPEN IN THE 10 IMPORT STEP)
foodInspect[ , Latitude := as.numeric(Latitude)]
## FIX SANITATION LATITUDE
## (THIS SHOULD HAPPEN IN THE 10 IMPORT STEP)
sanitationComplaints[ , Latitude := as.numeric(Latitude)]
## FIX GARBAGE CART LATITUDE
## (THIS SHOULD HAPPEN IN THE 10 IMPORT STEP)
garbageCarts[ , Latitude := as.numeric(Latitude)]

Update README.md

Check that the README.md reflects the most recent steps, and is complete with the current process.

Remove unnecessary terminal outputs from 12_Merge.R

Update README file

Need to update this to include the relevant files, license (MIT), and execution instructions.

Cannot execute sqldf code on server

The recreate_training_data.R file uses sqldf to perform some SQL operations (joining and filtering). Since we're using R-3.0.2, the latest version of SQLDF was incompatible (requires R >= 3.1). I included scripts to manually install it which was included in 5b52d79

However, appears we have some driver issues, which may be related to the nature of the install. Specifically, receive this error message:

Error in validObject(.Object) : 
  invalid class “SQLiteDriver” object: invalid object for slot "Id" in class "SQLiteDriver": got class "integer", should be or extend class "externalptr"
Error in !dbPreExists : invalid argument type

I'll be switching to a Mac to do the intermediate work. If we cannot get sqldf() to work, we can use the merge() function.

Fix penalty in glmnet model

Currently the penalty in the model is ifelse(grepl("^Inspector.Assigned", colnames(mm)), 1, 0) however it should be ifelse(grepl("^Inspector.Assigned", colnames(mm)), 1, 0) because the Inspector.Assigned variable name was changed some time ago before entering the model matrix.

Banner image does not display on iOS devices

Confirmed on iPad and iPhone. Appears to be a problem with the type of effect I'm using for parallax. Some documentation points out the use of jQuery is better, so will explore those options.

Fix sloppy weather code in 12 merge

The weather data is being read in at the start of the file, but then overwritten later on with a combination of weather and weather_new.

Although there is not a problem with the functionality, this is bad practice and should be fixed. The right thing would be to keep the objects separate (e.g. use weather_old and weather_new then combine them into weather), or at least combine them right after reading them in.

Test dev branch before merge to master

Code should be ran from beginning to end to ensure it can run with no errors. The following components should be tested:

  • Generate training data
  • Generate out-of-sample data
  • Execute GLMnet code

Proposed edits to technical document

There are several changes that I would like to propose for the document, and updates, so creating a new issue for that work (this branch will be a descendant of issue 47).

  • Reword the summary to focus on "days sooner" rather than percent improvement.
  • Add mentions of the model and software to the introduction
  • Add subsections
  • Fix typos
  • Update formulas
  • Experiment with citation representation
  • Add citation for mass package
  • Describe the KDE method for heat variables

Location of technical article?

This is annoying late in the game, but proposing a conversation on where to place the article in the repo. Right now it's placed in the root. However, there are a number of files/artifacts for this article (as opposed to REPORTS).

.
├── (other files)
├── REPORTS/
├── article
|   ├── forecasting-restaurants-with-critical-violations-in-Chicago.Rmd (Kntir doc)
|   ├── forecasting-restaurants-with-critical-violations-in-Chicago.pdf (PDF)
|   ├── forecasting-restaurants-with-critical-violations-in-Chicago.doc (Word)
|   ├── forecasting-restaurants-with-critical-violations-in-Chicago.html (Webpage)
|   ├── forecasting-restaurants-with-critical-violations-in-Chicago.bib (BibTeX of references)
|   └──  forecasting-restaurants-with-critical-violations-in-Chicago-citation.bib (PROPOSED: BibTeX of citation for what we write)

As you can see, I am pointing-out the inclusion of BibTeX file needed to generate the article's bibliography. I'm also proposing a BibTeX file that shows how to cite the article we have prepared.

Thoughts?

Change liveReadInFoodInspections.R to a function

Right now, the code references the source code, which conducts a number of operations. It would be more useful to change this into a function which uses user-defined parameters for the data range. For instance:

foodInspect <- liveReadInFoodInspections(start.date="%Y-%m-%d", end.date="%Y-%m-%d")

Streamline and package "data update" step

Right now this process assumes a manually created single point in time download.

We need a way to refresh / add new data. This should be accomplished by adding function(s) that encompass the entire data step.

It makes sense to tackle some other issues in this step:

  • Fix Weather import process (issue 32)
  • Convert latitude to numeric during import (issue 31)
  • Add "Cleaned" facility type to data (issue 30)
  • Merge logic (issue 22) This issue is almost a repeat of the current issue (38)
  • Remove old data files and put into top level DATA folder (new issue)

Correct type-o's in gh-pages

Fix the following issues:

  • Remove references to facility type and risk variables
  • Update bar chart with revised percentages

Mask food inspector identity

Find a way to represent the food inspector information without being individually identifiable, and without dramatically impacting model quality.

gh-pages nav bar taller than defined

Unsure why the navbar is appearing taller than defined in the CSS. See below showing the difference between what is defined in the CSS and what is in the browser.
screen shot 2015-02-08 at 2 56 52 pm

Reproducible documentation that summarizes evaluation findings

Should summarize the findings in a reproducible document (e.g., knitr) that summarizes the findings.

Let's use checkboxes to flag items that should be in the final document. Here are some of my items (revised on 2/3):

  • Reference model source in knitr document
  • Update percentage difference found in the first period between BAU and DDM
  • table of variables [2]
  • additional narrative on how we masked food inspectors from public model [3.1]
  • output of regression results (significant variables) [3.1]
  • "Graph of "days sooner that a critical violation would have been discovered" [4.1]
  • "Critical violations found on a daily basis as a percent of total inspections performed" [4.1]
  • "Bar plot of finding violations in in the first period by group (BAU v model)" [4.1]
  • "Cumulative Critical Violations Found BAU Versus Model" [4.1]
  • "Cumulative difference between violations found (BAU and model) and best-case-scenario" [4.1]

Supersedes issue #5

Fonts on different computers on gh-pages

It doesn't appear the League Spartan font is appearing appropriately on all computers. Need to further test and fix this.

(I think it's just a problem on how the font is being referenced in the CSS).

[Request] Break out logic in 12_Merge step

The calculated variables should be put into separate functions or scripts, and the merge should just merge in the output form a function or the precalculated information based on the script.

Basis for data: Food inspections

Likely candidates for new scripts:

  • Food inspection history
  • Associated business characteristics
  • Weather data
  • Heat data variables
  • Inspector information

Create graphs summarizing findings

A few graphs that should be created:

  • % of critical violations found vs % of inspections conducted (for both the actual and simulated results)
  • Number of critical violations by month for actual and simulated results

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.