Comments (9)
I encountered a slight wrinkle... I also need to change the model script in some way so that we can read in the canned weather data.
In the model we read in the weather from DATA/17_mongo_weather_update.Rds
, which is a result of the script CODE/17_mongo_weather_update.R
. The download script gets the weather data from Mongo of course, which an internal system. (It's called "update" because it's the
In the evaluation we read the weather from DATA/weather_20110401_20141031.Rds
.
Two solutions:
- Simply rename
weather_20110401_20141031.Rds
to17_mongo_weather_update.Rds
- Update the mongo weather with some more recent weather data that reflects what we actually use in the model.
The weather data that we use in the model is similar to the data stored in the project, but it's not the same. I vaguely remember reconciling this at some point, the plots looks familiar
Here pi is the new precipitation intensity
and tm is the new temperature max
I'm leaning toward the first solution, i.e. renaming the file so that the data pipeline just works.
from food-inspections-evaluation.
One more problem before I can make sure that the new model code works; the tobacco category changed from tobacco_retail_over_counter
to tobacco
, so again the canned data doesn't work in the model.
In this case I think it would be best to refresh the business license information.
Other solutions, like conditionally renaming the field / column header are also possible, but I think they're more likely to lead to confusion and cause bugs.
The business license data has changed quite a bit since stored it back in the first evaluation, there are more columns and (of course) many more records, so it's a much larger data set than before, but the content should be the same.
@tomschenkjr thoughts on this?
from food-inspections-evaluation.
from food-inspections-evaluation.
I just re-downloaded the business data.
Previously we had 470,994 records, now we have 923,834 records.
However, the number of business records that go into the model as features is the same after the download, 27.600. So, it looks like the business data is consistent after doing the filtering, subsetting, and matching.
from food-inspections-evaluation.
Note that the records on the data portal represent Licenses... but the 27,600 figure represents the licenses reshaped and summarized by business.
from food-inspections-evaluation.
So after I make all these changes and refresh the business license data the final xmat looks pretty similar to the original. The first few rows are identical and the number or rows in xmat goes from 18,712 to 18,781.
old xmat structure:
# > str(xmat)
# Classes ‘data.table’ and 'data.frame': 18712 obs. of 13 variables:
# $ Inspection_ID : num 269961 507211 507212 507216 507219 ...
# $ Inspector : chr "green" "blue" "blue" "blue" ...
# $ pastSerious : num 0 0 0 0 0 0 0 0 0 0 ...
# $ pastCritical : num 0 0 0 0 0 0 0 0 0 0 ...
# $ timeSinceLast : num 2 2 2 2 2 2 2 2 2 2 ...
# $ ageAtInspection : int 1 1 1 1 1 1 0 1 1 0 ...
# $ consumption_on_premises_incidental_activity: num 0 0 0 0 0 0 0 0 0 0 ...
# $ tobacco_retail_over_counter : num 1 0 0 0 0 0 0 0 0 0 ...
# $ temperatureMax : num 53.5 59 59 56.2 52.7 ...
# $ heat_burglary : num 26.99 13.98 12.61 35.91 9.53 ...
# $ heat_sanitation : num 37.75 15.41 8.32 38.19 2.13 ...
# $ heat_garbage : num 12.8 12.9 8 26.2 3.4 ...
# $ criticalFound : num 0 0 0 0 0 0 0 0 0 0 ...
# - attr(*, "sorted")= chr "Inspection_ID"
# - attr(*, ".internal.selfref")=<externalptr>
new xmat structure:
> str(xmat)
Classes ‘data.table’ and 'data.frame': 18781 obs. of 13 variables:
$ Inspection_ID : num 269961 507211 507212 507216 507219 ...
$ criticalFound : num 0 0 0 0 0 0 0 0 0 0 ...
$ pastSerious : num 0 0 0 0 0 0 0 0 0 0 ...
$ ageAtInspection : int 1 1 1 1 1 1 0 1 1 0 ...
$ pastCritical : num 0 0 0 0 0 0 0 0 0 0 ...
$ consumption_on_premises_incidental_activity: num 0 0 0 0 0 0 0 0 0 0 ...
$ tobacco : num 1 0 0 0 0 0 0 0 0 0 ...
$ temperatureMax : num 53.5 59 59 56.2 52.7 ...
$ heat_burglary : num 26.99 13.98 12.61 35.91 9.53 ...
$ heat_sanitation : num 37.75 15.41 8.32 38.19 2.13 ...
$ heat_garbage : num 12.8 12.9 8 26.2 3.4 ...
$ Inspector : chr "green" "blue" "blue" "blue" ...
$ timeSinceLast : num 2 2 2 2 2 2 2 2 2 2 ...
- attr(*, "sorted")= chr "Inspection_ID"
- attr(*, ".internal.selfref")=<externalptr>
Obviously looking at the structure isn't very detailed, but I'll check out the model in a minute.
from food-inspections-evaluation.
Actually it's going to take a minute to do an apples to apples comparison. The production model doesn't split the test / train data, because in production we build the model on everything.
from food-inspections-evaluation.
It looks like the model is producing very similar results, but I'm not ready to push up the draft of the evaluation.
The gini on the test data is 34.6%, the previous number was 34.5%... so the results look comparable.
from food-inspections-evaluation.
I mostly finished up the split on Wednesday, then I noticed a few things that needed attention yesterday and I fixed those and pushed up a big commit. I'm testing it now with the downstream prediction step.
It was tricky to find a clean way to run the production and evaluation models with the same code. I solved this by running the model twice, once with all the data and once with all the data except the past 90 days. This works because the evaluation data has that big gap between the test / train periods, so we didn't need to be explicit about the exact start and end of the experiment; it's implicitly defined by the availability of the data.
I also had to reorganize the workflow a bit to accommodate handling the test / train index management. Previously we could just take out the NA values right before we fit the model, but now we needed to be more careful or else the matrix would have different rows than the source data frame... which is where the test train index is stored. Obviously there are lots of ways to solve this sort of thing, I tried to choose something that kept the code easy to follow and audit.
from food-inspections-evaluation.
Related Issues (20)
- Updating download scripts / data cache to be in sync with the model code
- Update 00_Startup.R
- Split violation matrix calculation
- In `GenerateOtherLicenseInfo` guard against case with too few categories HOT 1
- Train/test data includes schools, hospitals, and other facility types HOT 3
- violations matrix HOT 1
- Predictions API
- Cannot find Inspection_Date problem
- Inspections are cyclic; how does prioritizing them help? HOT 2
- Would you mind adding a license to the code? HOT 3
- Facilitating Redeploying this Model in Other Cities HOT 6
- violations_dat.Rds does not have filtered inspections, but all inspects HOT 1
- bad characters in inspectors data (trivial) HOT 1
- Website header area is off-center HOT 1
- Update download steps to use RSocrata from CRAN
- Refactor `eval_model` and integrate evaluation function more deeply with `30_glmnet_model.R`
- Social media data as a predictor? HOT 1
- Source of weather data? HOT 5
- Report Metric Development relies on datTest which is created in CODE/31 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from food-inspections-evaluation.