drivendataorg / cyfi Goto Github PK
View Code? Open in Web Editor NEWEstimate cyanobacteria density based on Sentinel-2 satellite imagery
Home Page: https://cyfi.drivendata.org/
License: MIT License
Estimate cyanobacteria density based on Sentinel-2 satellite imagery
Home Page: https://cyfi.drivendata.org/
License: MIT License
HISTORY.md
examples
foldermacos-latest
to the os
list in tests.yml
. See this slack threadAs part of this work, we want to have some deliverables [pages / slides / blog post / demo] that shows how this can be used.
Starting this issue to capture ideas:
Format could be things like:
Could be set up either as narrative story telling or demo video. Could live potentially on beta page (through NASA)
See if we can make PC search more efficient
pc_meters_search_window
. The current number for pc_meters_search_window
is taken directly from the third place solutionFrom experimentation during competition prep, we got more results that include the given point when we use a large bounding box (which is a bit odd).
Decide how we want to handle cases where:
For samples with no imagery, third place predicts the average of all predicted severities in a sample's given region. We can't do this consistently in a production context.
Note that LightGBM is able to handle missing values as input.
Change package to be able to handle s3 paths, and load from / save to s3 locations
For useful example of simple README: https://github.com/drivendataorg/repro-zipfile
test_generate_candidate_metadata
In the final package, we will assume that planetary computer search results will always be regenerated
preds.csv
, include all samples and have nans for samples with no featuresresults.json
, also include the number and percent of samples that we did not predict onRight now we have test assets for evaluate_data.csv
, train_data.csv
, and predict_data.csv
. See if we can consolidate into fewer. Eg. We may be able to have a longer file (~20 rows) and then divvy it up into train/test files as needed
Get a very simple experiment to run end-to-end
Rather than hitting planetary computer APIs in the tests, mock the results from the API our tests do not fail as a result of something API-related (eg. too many hits, internet issues, etc.)
Consider having one test that does hit the API, and then just that test fails if we hit the API
Use past PC search results data if available
Modify the satellite querying process in the package to match how we'll need to pull in data from our big PC search for all competition data
cli.py
into experiment.py
train_test_split.py
README.md
that there is an unsupported experiment module for training new modelsFinal pieces of #41 (training with folds)
Within the cache directory, cache to a folder specified by the hash of the features config. Then if we have relevant previously saved imagery, we can use the existing imagery
We want a user-friendly way to peruse the test set predictions on imagery.
Right now we've got a notebook with some examples and it shouldn't be that much more work to put this into a gradio app. We can precompute the images links for each of the test set points so the app just needs to load and render a specified box.
Gradio demo code: https://huggingface.co/spaces/gradio/xgboost-income-prediction-with-explainability/blob/main/app.py
With our updated broader severity bins, regenerate our best model evaluation files (s3://drivendata-competition-nasa-cyanobacteria/experiments/results/best_model
) so the metrics will be comparable to any new models trained
Our package draws from results generated by code in the competition repo
needs #73
sentinel_200
)Our logging was written mostly with experimentation in mind. Go through what the logs will look like for a user, and:
Before we can run any experiments, we need to reproduce one winner's submission. This achieves the following:
I'd suggest starting with the predict pipeline for simplicity but we'll need to have both for experiments.
We could add an advanced use page that outlines how to:
python cyfi/experiment.py --help
cyfi predict --model-path...
It may be worth using a real experiment-tracking tool for this. Let's spend an hour or two looking at options and seeing if they're a good fit.
Right now extra fields are ignored if passed in. This means if a use has a typo in something like "features_config" in their yaml, they'll get the defaults rather than what they specified and it will happen silently.
Actions:
If the input csv contains duplicate sample point (location/date combo), we should just predict the same value for both of those (i.e. treat them as independent observations). Currently the code errors.
Pathway to error
This loc means we can end up with a dataframe with multiple rows rather than a series here.
And then sample.latitude
is not a single value
Which causes this to error
Suggested implementation
This code should just be iterating over rows, rather than loc-ing to get the row.
Info we need to tell whether we have the array we need is:
Within the cache dir, we can create a directory with consistent naming that indicates meter bounding box. Within that, the file structure can identify specific item / sample combos.
Then we'll be able to construct a specific path, and check whether it already exists before downloading
Since we're not using setuptools, we may not want to use a Manifest.in
to include the model asset in source distribution. Check that we are doing this correctly.
See slack thread for details
Change CLI options to be more user friendly
In predict
, have arguments / options for:
samples_path
(as is)model_path
(as is)output_file
: file path to save the predictionsoutput_directory
: if this is specified, output_file
will be relative to output_directory
. Default to the current directorykeep_features
: If true, we'll copy the train and test features to the output_directoy
.overwrite
: Just check whether the prediction path exists, and ask about overwriting predictions. Don't check about overwriting train / test featuresIn evaluate
, add an overwrite
flag as well
Test training with k-folds. Likely best to emulate third place's methodology.
See past work here
For some points/item combos, we have a satellite tile but there the bounding box contains entirely no data pixels. We should:
This is in line with not using/predicting samples for which there is not imagery.
We can identify these as rows in the satellite data where the values are 0
for all satellite values.
s3://drivendata-public-assets/land_cover_map.tar.gz
) and unzipAllowing a user to pass in a lat, lon, and date instead of a csv to try out a prediction on a single point is a nice to have.
predict_point
. in the separate command, output path can be optional -- we can just print to CLI if no output path is specifiedModelTrainingConfig
and also save it out with pipeline._to_disk
, and use that to determine the target columns. Rename to ModelConfig
, and make it a required input to CyanoModelPipeline
pipeline.target_col = model_config.target_col
(in addition to settings pipeline.model_config
)experiment.py
to remove target_col
(now part of model config)pipeline._predict_model
.density_metrics
to log_density_metrics
, and if we only have exact density convert to log density before generating. Still output plots, possibly through exact density but with log scaleneeds #6
I tried passing in a point from google maps for the SF bay and got an error
cyano predict-point -lat -122.3753 -lon 37.763987
...
ValueError: Latitude must be in the [-90; 90] range.
Which reminded me that we only support one CRS and don't specify what that is.
Minimum:
Nice to have:
Some sample / item combinations raise a rioxarray.exceptions.NoDataInBounds
error. In some cases, this happens even when the bounding box used is within the bounds of the imagery
satellite_data.py/download_satellite_data
related to #22
based on #9
Try and implement MLFlow for experiment tracking
If we don't use MLFlow, just create a google doc or google slides to document the conclusion / decision from each experiment. E.g. what are we freezing moving forward based on the experiment results?
One of the things we know NOAA will do is some spot checks, e.g. looking at imagery and seeing if predictions align and looking at temporal consistency.
As a first pass, let's look at temporal consistency.
needs #54
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.