drivendataorg / cyfi Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 2.0 10.71 MB

Estimate cyanobacteria density based on Sentinel-2 satellite imagery

Home Page: https://cyfi.drivendata.org/

License: MIT License

Makefile 3.31% Python 96.69%

cyanobacteria habs satellite-imagery sentinel-2

cyfi's People

Contributors

Stargazers

Watchers

Forkers

hyunglok-kim slflood

cyfi's Issues

Cut package release

Make sure README is fully updated
update HISTORY.md
Make sure docs are fully set up
either update or remove the examples folder
Make sure package name is updated everywhere
Add github workflows to release to pypi
Add back in macos-latest to the os list in tests.yml. See this slack thread

Placeholder issue for user-centered outputs

As part of this work, we want to have some deliverables [pages / slides / blog post / demo] that shows how this can be used.

Starting this issue to capture ideas:

rank regions of concern
overlay predictions on a map to visualize
testimonial from a state (e.g. California)
metrics on presence / absence

Format could be things like:

slide deck
blog post
page in docs

Could be set up either as narrative story telling or demo video. Could live potentially on beta page (through NASA)

Experiment with making PC search more efficient

See if we can make PC search more efficient

test using smaller values for pc_meters_search_window. The current number for pc_meters_search_window is taken directly from the third place solution
test using intersects instead of bounding box in our search

From experimentation during competition prep, we got more results that include the given point when we use a large bounding box (which is a bit odd).

Determine how to handle missing satellite imagery

Decide how we want to handle cases where:

No imagery is found for a sample. Currently we fill imagery features with the mean over all rows, but we probably want to do something more sophistocated
an item is selected for a sample in the metadata, but no imagery is downloaded for the item/sample
an item is selected for a sample in the metadata, and only some of the required bands are successfully downloaded. Currently, none of the item's imagery is used in this case

For samples with no imagery, third place predicts the average of all predicted severities in a sample's given region. We can't do this consistently in a production context.

Note that LightGBM is able to handle missing values as input.

Directly write to / load from s3

Change package to be able to handle s3 paths, and load from / save to s3 locations

Write package docs

For useful example of simple README: https://github.com/drivendataorg/repro-zipfile

Set severity levels based on WHO guidance

Rather than using the severity levels from Tick Tick Bloom, let's follow WHO guidance.

Very-high has proven to be quite rare, so low / med / high should be sufficient. Users can always bin differently given that we output density.

Remove functionality to use past PC search in package

Before releasing as a package, remove the default behavior in this PR that uses past planetary computer search results for all competition data.
Remove tests as needed, e.g. test_generate_candidate_metadata

In the final package, we will assume that planetary computer search results will always be regenerated

Only evaluate experiments on data with features

Only predict on samples with at least one non-metadata feature
Only evaluate on samples with at least one non-metadata feature
In preds.csv, include all samples and have nans for samples with no features
In results.json, also include the number and percent of samples that we did not predict on

Right now we have test assets for evaluate_data.csv, train_data.csv, and predict_data.csv. See if we can consolidate into fewer. Eg. We may be able to have a longer file (~20 rows) and then divvy it up into train/test files as needed

Run end-to-end experiment

Get a very simple experiment to run end-to-end

Mock calls to APIs in tests

Rather than hitting planetary computer APIs in the tests, mock the results from the API our tests do not fail as a result of something API-related (eg. too many hits, internet issues, etc.)

Consider having one test that does hit the API, and then just that test fails if we hit the API

Use past PC results data if available

Use past PC search results data if available
Modify the satellite querying process in the package to match how we'll need to pull in data from our big PC search for all competition data

Update `cyano/experiment` folder for package

Isolate experimentation to a single file
Move the CLi command from cli.py into experiment.py
Remove past configs and train_test_split.py
Add note in README.md that there is an unsupported experiment module for training new models

Finalize folds implementation

Final pieces of #41 (training with folds)

Update evaluation code to get feature importances
If someone doesn't have region and n_folds > 1, warn when doing model training and train without folds. Change _prep_train_data so that we don't error if region is not provided and n_folds > 1
Also warn if someone has n_folds > 1 and fewer samples than n_folds
change if / else order in training so that first we check whether we're training with multiple folds

Cache with feature options hash

Within the cache directory, cache to a folder specified by the hash of the features config. Then if we have relevant previously saved imagery, we can use the existing imagery

Create gradio app with test set points and predictions on imagery

We want a user-friendly way to peruse the test set predictions on imagery.

Right now we've got a notebook with some examples and it shouldn't be that much more work to put this into a gradio app. We can precompute the images links for each of the test set points so the app just needs to load and render a specified box.

Gradio demo code: https://huggingface.co/spaces/gradio/xgboost-income-prediction-with-explainability/blob/main/app.py

Do gridsearch for best LightGBM parameters

Retrain best model with severity bins

With our updated broader severity bins, regenerate our best model evaluation files (s3://drivendata-competition-nasa-cyanobacteria/experiments/results/best_model) so the metrics will be comparable to any new models trained

Add tests of full PC search results

Our package draws from results generated by code in the competition repo

Add more checks to make sure the results are correct. Eg. This could look like getting the metadata for 100 points and comparing those rows to the relevant ones generated from this code.

Sync final feature cache data to s3

needs #73

change the name to match what the correct features hash (rather than sentinel_200)

Make logging more intuitive for users

Our logging was written mostly with experimentation in mind. Go through what the logs will look like for a user, and:

make sure all log messages are easy to interpret
consider changing what is logged so key pieces of information are easier to pick
Remove progress bars when not relevant

Reproduce one winning submission in our code package

Before we can run any experiments, we need to reproduce one winner's submission. This achieves the following:

ensures our refactored code is working as expected
our pipes are hooked up properly
we have a baseline against which to compare experiments

I'd suggest starting with the predict pipeline for simplicity but we'll need to have both for experiments.

Clean up code for package

remove some extra code from experimentation that's no longer needed (see #34 )
add docstrings

Add advanced use docs page

We could add an advanced use page that outlines how to:

train your own model and specify different configuration options (features config, model config, experiment config)
run an experiment with python cyfi/experiment.py --help
use a custom model to predict at the command line cyfi predict --model-path...

Look into experiment-tracking tools

It may be worth using a real experiment-tracking tool for this. Let's spend an hour or two looking at options and seeing if they're a good fit.

weights and biases
neptune
ML flow
comet
dvc studio

Write model performance page

Write basic tests

write tests for key components of package code

Scaffold pipeline in new repo

scaffold out functions and pipeline based on this slide. See visiomel as an example

needs #131

Do not allow extra fields on config models

Right now extra fields are ignored if passed in. This means if a use has a typo in something like "features_config" in their yaml, they'll get the defaults rather than what they specified and it will happen silently.

Actions:

do not allow extra fields (look up the right way to do this in pydantic v2)
check we get an error if there are extra fields passed in in the config tests

Filter train and test set to closer to water

filter train and test sets to within ~500m of water

Duplicate sample UIDs will cause an error

If the input csv contains duplicate sample point (location/date combo), we should just predict the same value for both of those (i.e. treat them as independent observations). Currently the code errors.

Pathway to error

This loc means we can end up with a dataframe with multiple rows rather than a series here.

https://github.com/drivendataorg/cyanobacteria-prediction/blob/689f1200adef7ceea943b5018152550af607286a/cyano/data/satellite_data.py#L374

And then sample.latitude is not a single value

https://github.com/drivendataorg/cyanobacteria-prediction/blob/689f1200adef7ceea943b5018152550af607286a/cyano/data/satellite_data.py#L380

Which causes this to error

https://github.com/drivendataorg/cyanobacteria-prediction/blob/689f1200adef7ceea943b5018152550af607286a/cyano/data/satellite_data.py#L38

Suggested implementation

This code should just be iterating over rows, rather than loc-ing to get the row.

https://github.com/drivendataorg/cyanobacteria-prediction/blob/689f1200adef7ceea943b5018152550af607286a/cyano/data/satellite_data.py#L437-L449

Use locally saved imagery if available

Info we need to tell whether we have the array we need is:

item ID
sample ID (hash)
bounding box size in meters

Within the cache dir, we can create a directory with consistent naming that indicates meter bounding box. Within that, the file structure can identify specific item / sample combos.

Then we'll be able to construct a specific path, and check whether it already exists before downloading

Determine best way to include model weights in distribution

Since we're not using setuptools, we may not want to use a Manifest.in to include the model asset in source distribution. Check that we are doing this correctly.

See slack thread for details

Update CLI options

Change CLI options to be more user friendly

In predict, have arguments / options for:

samples_path (as is)
model_path (as is)
output_file: file path to save the predictions
output_directory: if this is specified, output_file will be relative to output_directory. Default to the current directory
keep_features: If true, we'll copy the train and test features to the output_directoy.
overwrite: Just check whether the prediction path exists, and ask about overwriting predictions. Don't check about overwriting train / test features

In evaluate, add an overwrite flag as well

Look at adding land use as a feature

SWE land cover map: https://www.drivendata.org/competitions/90/competition-reclamation-snow-water-eval/page/432/#digital-surface-data

1st place's use of it: https://github.com/drivendataorg/snowcast-showdown/blob/8465867547a80905c787fd1ebbb767f048696fec/1st%20Place/src/features/constfeatures.py

Test adding k-folds in training

Test training with k-folds. Likely best to emulate third place's methodology.

See past work here

Test adding climate data into the model

see model performance changes
Short notebook assessing different performance on samples with satellite imagery vs without
Decide whether we want to generate predictions for samples that have no satellite imagery features, but do have climate features

Do not predict where there is no data in bounding box

For some points/item combos, we have a satellite tile but there the bounding box contains entirely no data pixels. We should:

drop these rows in training
drop these rows in prediction
have the prediction be nan for the point if there is no data in the bounding box for any item

This is in line with not using/predicting samples for which there is not imagery.

We can identify these as rows in the satellite data where the values are 0 for all satellite values.

Add command to download land cover file and add tests

Add a command to download the land cover file if not already in a user's cache directory.
Download from s3 (s3://drivendata-public-assets/land_cover_map.tar.gz) and unzip
Add tests for land cover data. See previous work here

Add ability to predict on a single point rather than always requiring a csv

Allowing a user to pass in a lat, lon, and date instead of a csv to try out a prediction on a single point is a nice to have.

add a separate app command eg predict_point. in the separate command, output path can be optional -- we can just print to CLI if no output path is specified

Move `target_col` to model configuration

move target column to ModelTrainingConfig and also save it out with pipeline._to_disk, and use that to determine the target columns. Rename to ModelConfig, and make it a required input to CyanoModelPipeline
when pipeline is instantiated, set pipeline.target_col = model_config.target_col (in addition to settings pipeline.model_config)
Update experiment.py to remove target_col (now part of model config)
by default, output severity and exact density in predictions rather than log density in pipeline._predict_model.
Adjust eval code to change density_metrics to log_density_metrics, and if we only have exact density convert to log density before generating. Still output plots, possibly through exact density but with log scale
Right new tests for the updated code
Update the best model in the git repo for the new model.zip structure (features_config.yaml and model_training_config.yaml)

Set up github workflow for tests

needs #6

set up github workflow to run tests

Specify CRS for sample points

I tried passing in a point from google maps for the SF bay and got an error

 cyano predict-point -lat -122.3753 -lon 37.763987
...
ValueError: Latitude must be in the [-90; 90] range.

Which reminded me that we only support one CRS and don't specify what that is.

Minimum:

add to help text which CRS is currently supported

Nice to have:

allow CRS to be set at command line (at least for predict point)

Test adding elevation data into the model

see model performance changes
Short notebook assessing different performance on samples with satellite imagery vs without
Decide whether we want to generate predictions for samples that have no satellite imagery features, but do have climate features

Investigate and better handle NoDataInBounds

Some sample / item combinations raise a rioxarray.exceptions.NoDataInBounds error. In some cases, this happens even when the bounding box used is within the bounds of the imagery

Determine why this error occurs. Are we calculating bounding boxes correctly?
Implement better handling of this error in satellite_data.py/download_satellite_data

related to #22

Try and implement MLFlow for experiment tracking

based on #9

Try and implement MLFlow for experiment tracking

If we don't use MLFlow, just create a google doc or google slides to document the conclusion / decision from each experiment. E.g. what are we freezing moving forward based on the experiment results?

Low-lift QA checks for temporal consistency

One of the things we know NOAA will do is some spot checks, e.g. looking at imagery and seeing if predictions align and looking at temporal consistency.

As a first pass, let's look at temporal consistency.

for some points in the test set, add additional rows for nearby dates (e.g. a 4 week window)
generate predictions
plot trends over time for predictions to see how similar they are
bonus: look at imagery for these predictions

needs #54

Train and commit final model

train final model (if not already trained)
add to assets and commit to repo
make CLI predict take a default model zip

Try predicting density directly rather than severity buckets

The winning models are very good at comparatively sorting points based on density, but don't always assign exactly the right bucket.

What happens if we use all of the same features but train a model to predict either exact density (cells/mL) or log of exact density?