Giter Site home page Giter Site logo

drivendataorg / cyfi Goto Github PK

View Code? Open in Web Editor NEW
17.0 17.0 2.0 10.71 MB

Estimate cyanobacteria density based on Sentinel-2 satellite imagery

Home Page: https://cyfi.drivendata.org/

License: MIT License

Makefile 3.31% Python 96.69%
cyanobacteria habs satellite-imagery sentinel-2

cyfi's People

Contributors

ejm714 avatar jayqi avatar klwetstone avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cyfi's Issues

Cut package release

  • Make sure README is fully updated
  • update HISTORY.md
  • Make sure docs are fully set up
  • either update or remove the examples folder
  • Make sure package name is updated everywhere
  • Add github workflows to release to pypi
  • Add back in macos-latest to the os list in tests.yml. See this slack thread

Placeholder issue for user-centered outputs

As part of this work, we want to have some deliverables [pages / slides / blog post / demo] that shows how this can be used.

Starting this issue to capture ideas:

  • rank regions of concern
  • overlay predictions on a map to visualize
  • testimonial from a state (e.g. California)
  • metrics on presence / absence

Format could be things like:

  • slide deck
  • blog post
  • page in docs

Could be set up either as narrative story telling or demo video. Could live potentially on beta page (through NASA)

Experiment with making PC search more efficient

See if we can make PC search more efficient

  • test using smaller values for pc_meters_search_window. The current number for pc_meters_search_window is taken directly from the third place solution
  • test using intersects instead of bounding box in our search

From experimentation during competition prep, we got more results that include the given point when we use a large bounding box (which is a bit odd).

Determine how to handle missing satellite imagery

Decide how we want to handle cases where:

  • No imagery is found for a sample. Currently we fill imagery features with the mean over all rows, but we probably want to do something more sophistocated
  • an item is selected for a sample in the metadata, but no imagery is downloaded for the item/sample
  • an item is selected for a sample in the metadata, and only some of the required bands are successfully downloaded. Currently, none of the item's imagery is used in this case

For samples with no imagery, third place predicts the average of all predicted severities in a sample's given region. We can't do this consistently in a production context.

Note that LightGBM is able to handle missing values as input.

Set severity levels based on WHO guidance

Rather than using the severity levels from Tick Tick Bloom, let's follow WHO guidance.

Image

Very-high has proven to be quite rare, so low / med / high should be sufficient. Users can always bin differently given that we output density.

Remove functionality to use past PC search in package

  • Before releasing as a package, remove the default behavior in this PR that uses past planetary computer search results for all competition data.
  • Remove tests as needed, e.g. test_generate_candidate_metadata

In the final package, we will assume that planetary computer search results will always be regenerated

Only evaluate experiments on data with features

  • Only predict on samples with at least one non-metadata feature
  • Only evaluate on samples with at least one non-metadata feature
  • In preds.csv, include all samples and have nans for samples with no features
  • In results.json, also include the number and percent of samples that we did not predict on

Consolidate test assets

Right now we have test assets for evaluate_data.csv, train_data.csv, and predict_data.csv. See if we can consolidate into fewer. Eg. We may be able to have a longer file (~20 rows) and then divvy it up into train/test files as needed

Mock calls to APIs in tests

Rather than hitting planetary computer APIs in the tests, mock the results from the API our tests do not fail as a result of something API-related (eg. too many hits, internet issues, etc.)

Consider having one test that does hit the API, and then just that test fails if we hit the API

Use past PC results data if available

  • Use past PC search results data if available

  • Modify the satellite querying process in the package to match how we'll need to pull in data from our big PC search for all competition data

Update `cyano/experiment` folder for package

  • Isolate experimentation to a single file
  • Move the CLi command from cli.py into experiment.py
  • Remove past configs and train_test_split.py
  • Add note in README.md that there is an unsupported experiment module for training new models

Finalize folds implementation

Final pieces of #41 (training with folds)

  • Update evaluation code to get feature importances
  • If someone doesn't have region and n_folds > 1, warn when doing model training and train without folds. Change _prep_train_data so that we don't error if region is not provided and n_folds > 1
  • Also warn if someone has n_folds > 1 and fewer samples than n_folds
  • change if / else order in training so that first we check whether we're training with multiple folds

Cache with feature options hash

Within the cache directory, cache to a folder specified by the hash of the features config. Then if we have relevant previously saved imagery, we can use the existing imagery

Create gradio app with test set points and predictions on imagery

We want a user-friendly way to peruse the test set predictions on imagery.

Right now we've got a notebook with some examples and it shouldn't be that much more work to put this into a gradio app. We can precompute the images links for each of the test set points so the app just needs to load and render a specified box.

Image

Gradio demo code: https://huggingface.co/spaces/gradio/xgboost-income-prediction-with-explainability/blob/main/app.py

Retrain best model with severity bins

With our updated broader severity bins, regenerate our best model evaluation files (s3://drivendata-competition-nasa-cyanobacteria/experiments/results/best_model) so the metrics will be comparable to any new models trained

Add tests of full PC search results

Our package draws from results generated by code in the competition repo

  • Add more checks to make sure the results are correct. Eg. This could look like getting the metadata for 100 points and comparing those rows to the relevant ones generated from this code.

Make logging more intuitive for users

Our logging was written mostly with experimentation in mind. Go through what the logs will look like for a user, and:

  • make sure all log messages are easy to interpret
  • consider changing what is logged so key pieces of information are easier to pick
  • Remove progress bars when not relevant

Reproduce one winning submission in our code package

Before we can run any experiments, we need to reproduce one winner's submission. This achieves the following:

  • ensures our refactored code is working as expected
  • our pipes are hooked up properly
  • we have a baseline against which to compare experiments

I'd suggest starting with the predict pipeline for simplicity but we'll need to have both for experiments.

Add advanced use docs page

We could add an advanced use page that outlines how to:

  • train your own model and specify different configuration options (features config, model config, experiment config)
  • run an experiment with python cyfi/experiment.py --help
  • use a custom model to predict at the command line cyfi predict --model-path...

Look into experiment-tracking tools

It may be worth using a real experiment-tracking tool for this. Let's spend an hour or two looking at options and seeing if they're a good fit.

  • weights and biases
  • neptune
  • ML flow
  • comet
  • dvc studio

Do not allow extra fields on config models

Right now extra fields are ignored if passed in. This means if a use has a typo in something like "features_config" in their yaml, they'll get the defaults rather than what they specified and it will happen silently.

Actions:

  • do not allow extra fields (look up the right way to do this in pydantic v2)
  • check we get an error if there are extra fields passed in in the config tests

Duplicate sample UIDs will cause an error

If the input csv contains duplicate sample point (location/date combo), we should just predict the same value for both of those (i.e. treat them as independent observations). Currently the code errors.

Pathway to error

This loc means we can end up with a dataframe with multiple rows rather than a series here.

https://github.com/drivendataorg/cyanobacteria-prediction/blob/689f1200adef7ceea943b5018152550af607286a/cyano/data/satellite_data.py#L374

And then sample.latitude is not a single value

https://github.com/drivendataorg/cyanobacteria-prediction/blob/689f1200adef7ceea943b5018152550af607286a/cyano/data/satellite_data.py#L380

Which causes this to error

https://github.com/drivendataorg/cyanobacteria-prediction/blob/689f1200adef7ceea943b5018152550af607286a/cyano/data/satellite_data.py#L38

Suggested implementation

This code should just be iterating over rows, rather than loc-ing to get the row.

https://github.com/drivendataorg/cyanobacteria-prediction/blob/689f1200adef7ceea943b5018152550af607286a/cyano/data/satellite_data.py#L437-L449

Use locally saved imagery if available

Info we need to tell whether we have the array we need is:

  • item ID
  • sample ID (hash)
  • bounding box size in meters

Within the cache dir, we can create a directory with consistent naming that indicates meter bounding box. Within that, the file structure can identify specific item / sample combos.

Then we'll be able to construct a specific path, and check whether it already exists before downloading

Update CLI options

Change CLI options to be more user friendly

In predict, have arguments / options for:

  • samples_path (as is)
  • model_path (as is)
  • output_file: file path to save the predictions
  • output_directory: if this is specified, output_file will be relative to output_directory. Default to the current directory
  • keep_features: If true, we'll copy the train and test features to the output_directoy.
  • overwrite: Just check whether the prediction path exists, and ask about overwriting predictions. Don't check about overwriting train / test features

In evaluate, add an overwrite flag as well

Test adding climate data into the model

  • see model performance changes
  • Short notebook assessing different performance on samples with satellite imagery vs without
  • Decide whether we want to generate predictions for samples that have no satellite imagery features, but do have climate features

Do not predict where there is no data in bounding box

For some points/item combos, we have a satellite tile but there the bounding box contains entirely no data pixels. We should:

  • drop these rows in training
  • drop these rows in prediction
  • have the prediction be nan for the point if there is no data in the bounding box for any item

This is in line with not using/predicting samples for which there is not imagery.

We can identify these as rows in the satellite data where the values are 0 for all satellite values.

Move `target_col` to model configuration

  • move target column to ModelTrainingConfig and also save it out with pipeline._to_disk, and use that to determine the target columns. Rename to ModelConfig, and make it a required input to CyanoModelPipeline
  • when pipeline is instantiated, set pipeline.target_col = model_config.target_col (in addition to settings pipeline.model_config)
  • Update experiment.py to remove target_col (now part of model config)
  • by default, output severity and exact density in predictions rather than log density in pipeline._predict_model.
  • Adjust eval code to change density_metrics to log_density_metrics, and if we only have exact density convert to log density before generating. Still output plots, possibly through exact density but with log scale
  • Right new tests for the updated code
  • Update the best model in the git repo for the new model.zip structure (features_config.yaml and model_training_config.yaml)

Specify CRS for sample points

I tried passing in a point from google maps for the SF bay and got an error

 cyano predict-point -lat -122.3753 -lon 37.763987
...
ValueError: Latitude must be in the [-90; 90] range.

Which reminded me that we only support one CRS and don't specify what that is.

Minimum:

  • add to help text which CRS is currently supported

Nice to have:

  • allow CRS to be set at command line (at least for predict point)

Test adding elevation data into the model

  • see model performance changes
  • Short notebook assessing different performance on samples with satellite imagery vs without
  • Decide whether we want to generate predictions for samples that have no satellite imagery features, but do have climate features

Investigate and better handle NoDataInBounds

Some sample / item combinations raise a rioxarray.exceptions.NoDataInBounds error. In some cases, this happens even when the bounding box used is within the bounds of the imagery

  • Determine why this error occurs. Are we calculating bounding boxes correctly?
  • Implement better handling of this error in satellite_data.py/download_satellite_data

related to #22

Try and implement MLFlow for experiment tracking

based on #9

Try and implement MLFlow for experiment tracking

If we don't use MLFlow, just create a google doc or google slides to document the conclusion / decision from each experiment. E.g. what are we freezing moving forward based on the experiment results?

Low-lift QA checks for temporal consistency

One of the things we know NOAA will do is some spot checks, e.g. looking at imagery and seeing if predictions align and looking at temporal consistency.

As a first pass, let's look at temporal consistency.

  • for some points in the test set, add additional rows for nearby dates (e.g. a 4 week window)
  • generate predictions
  • plot trends over time for predictions to see how similar they are
  • bonus: look at imagery for these predictions

needs #54

Train and commit final model

  • train final model (if not already trained)
  • add to assets and commit to repo
  • make CLI predict take a default model zip

Try predicting density directly rather than severity buckets

The winning models are very good at comparatively sorting points based on density, but don't always assign exactly the right bucket.

Image

What happens if we use all of the same features but train a model to predict either exact density (cells/mL) or log of exact density?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.