Giter Site home page Giter Site logo

icenet-paper's Introduction

IceNet: Seasonal Arctic sea ice forecasting with probabilistic deep learning

DOI

This codebase accompanies the Nature Communications paper Seasonal Arctic sea ice forecasting with probabilistic deep learning. It includes code to fully reproduce all the results of the study from scratch. It also includes code to download the data generated by the study, published on the Polar Data Centre, and reproduce all the paper's figures.

The flexibility of the code simplifies possible extensions of the study. The data processing pipeline and custom IceNetDataLoader class lets you dictate which variables are input to the networks, which climate simulations are used for pre-training, and how far ahead to forecast. The architecture of the IceNet model can be adapted in icenet/models.py. The output variable to forecast could even be changed by refactoring the IceNetDataLoader class.

A demonstrator of this codebase (downloading pre-trained IceNet networks, then generating and analysing forecasts) produced by @acocac can be found in The Environmental Data Science Book.

The guidelines below assume you're working in the command line of a Unix-like machine with a GPU. If aiming to reproduce all the results of the study, 1 TB of space should safely cover the storage requirements from the data downloaded and generated.

If you run into issues or have suggestions for improvement, please raise an issue or email me ([email protected]).

Steps to plot paper figures using the paper's results & forecasts

To reproduce the paper figures directly from the paper's results and forecasts, run the following after setting up the conda environment (see Step 1 below):

  • ./download_paper_generated_data.sh. Downloads raw data from the paper. From here, you could start to explore the results of the paper in more detail.
  • python3 icenet/download_sic_data.py. This is needed to plot the ground truth ice edge. Note this download can take anywhere from 1 to 12 hours to complete.
  • python3 icenet/gen_masks.py
  • python3 icenet/plot_paper_figures.py. Figures are saved in figures/paper_figures/.

Steps to reproduce the paper's results from scratch

0) Preliminary setup

  • I use conda for package management. If you don't yet have conda, you can download it here.

  • To be able to download ERA5 data, you must first set up a CDS account and populate your .cdsapirc file. Follow the 'Install the CDS API key' instructions here.

  • To download the ECMWF SEAS5 forecast data for comparing with IceNet, you must first register with ECMWF here. If you are from an ECMWF Member State, you can then gain access to the ECMWF MARS Catalogue by contacting your Computing Representative. Once registered, obtain your API key here and fill the ECMWF API entries in icenet/config.py.

  • To track training runs and perform Bayesian hyperparameter tuning with Weights and Biases, sign up at https://wandb.ai/site. Obtain your API key from here and fill the Weights and Biases entries in icenet/config.py. Ensure you are logged in by running wandb login after setting up the conda environment.

1) Set up conda environment

After cloning the repo, run the commands below in the root of the repository to set up the conda environment:

  • If you don't have mamba already, install it to your base env for faster conda operations: conda install -n base mamba -c conda-forge.
  • For upgradeability use the versioned direct dependency environment file: mamba env create --file environment.yml
  • For reproducibility use the locked environment file: mamba env create --file environment.locked.yml
  • Activate the environment before running code: conda activate icenet

2) Download data

The CMIP6 variable naming convention is used throughout this project - e.g. tas for surface air temperature, siconca for sea ice concentration, etc.

Warning: some downloads are slow and the net download time can take 1-2 days. It may be advisable to write a bash script to automatically execute all these commands in sequence and run it over a weekend.

  • python3 icenet/gen_masks.py. This obtains masks for land, the polar holes, monthly maximum ice extent (the 'active grid cell region'), and the Arctic regions & coastline.

  • python3 icenet/download_sic_data.py. Downloads OSI-SAF SIC data. This computes monthly-averaged SIC server-side, downloads the results, and bilinearly interpolates missing grid cells (e.g. polar hole). Note this download can take anywhere from 1 to 12 hours to complete.

  • ./download_era5_data_in_parallel.sh. Downloads ERA5 reanalysis data. This runs multiple parallel python3 icenet/download_era5_data.py commands in the background to acquire each ERA5 variable. The raw ERA5 data is downloaded in global latitude-longitude format and regridded to the EASE grid that OSI-SAF SIC data lies on. Logs are output to logs/era5_download_logs/.

  • ./download_cmip6_data_in_parallel.sh. Downloads CMIP6 climate simulation data. This runs multiple parallel python3 icenet/download_cmip6_data.py commands in the background to download each climate simulation. The raw CMIP6 data is regridded from global latitude-longitude format to the EASE grid that OSI-SAF SIC data lies on. Logs are output to logs/cmip6_download_logs/

  • ./rotate_wind_data_in_parallel.sh. This runs multiple parallel python3 icenet/rotate_wind_data.py commands in the background to rotate the ERA5 and CMIP6 wind vector data onto the EASE grid. Logs are output to logs/wind_rotation_logs/.

  • ./download_seas5_forecasts_in_parallel.sh. Downloads ECMWF SEAS5 SIC forecasts. This runs multiple parallel python3 icenet/download_seas5_forecasts.py commands to acquire 2002-2020 SEAS5 forecasts for each lead time via the ECMWF MARS API and regrid the forecasts to EASE. The forecasts are saved to data/forecasts/seas5/ in the folders latlon/ and EASE/. Logs are output to logs/seas5_download_logs/.

  • python3 icenet/biascorrect_seas5_forecasts.py. Bias corrects the SEAS5 2012+ forecasts using 2002-2011 forecasts. Also computes SEAS5 sea ice probability (SIP) fields. The bias-corrected forecasts are saved as NetCDFs in data/forecasts/seas5/ with dimensions (target date, y, x, lead time).

3) Process data

3.1) Set up IceNet's custom data loader

  • python3 icenet/gen_data_loader_config.py. Sets up the data loader configuration. This is saved as a JSON file dictating IceNet's input and output data, train/val/test splits, etc.The config file is used to instantiate the custom IceNetDataLoader class. Two example config files are provided in this repository in dataloader_configs/. Each config file is identified by a dataloader ID, determined by a timestamp and a user-provided name (e.g. 2021_06_15_1854_icenet_nature_communications). The data loader ID, together with an architecture ID set in the training script, provides an 'IceNet ID' which uniquely identifies an IceNet ensemble model by its data configuration and architecture.

3.2) Preprocess the raw data

  • python3 icenet/preproc_icenet_data.py. Normalises the raw NetCDF data and saves it as monthly NumPy files. The normalisation parameters (mean/std dev or min/max) are saved as a JSON file so that new data can be preprocessed without having to recompute the normalisation. A custom IceNetDataPreProcessor class

  • The observational training & validation dataset for IceNet is just 23 GB, which can fit in RAM on some systems and significantly speed up the fine-tuning training phase compared with using the data loader. To benefit from this, run python3 icenet/gen_numpy_obs_train_val_datasets.py to generate NumPy tensors for the train/val input/output data. To further benefit from the training speed improvements of tf.data, generate a TFRecords dataset from the NumPy tensors using python3 icenet/gen_tfrecords_obs_train_val_datasets.py. Whether to use the data loader, NumPy arrays, or TFRecords datasets for training is controlled by bools in icenet/train_icenet.py.

4) Train IceNet

4.1) OPTIONAL: Run the hyperparameter search (skip if using default values from paper)

  • Set icenet/train_icenet.py up for hyperparameter tuning: Set pre-training and temperature scaling bools to False in the user input section.
  • wandb sweep icenet/sweep.yaml
  • Then run the wandb agent command that is printed.
  • Cancel the sweep after a sufficient picture on optimal hyperparameters is built up on the wandb.ai page.

4.2) Run training

  • Train IceNet networks with python3 icenet/train_icenet.py. This takes hyperameter settings and the random seed for network weight initalisation as command line inputs. Run this multiple times with different settings of --seed to train an ensemble. Trained networks are saved in trained_networks/<dataloader_ID>/<architecture_ID>/networks/. If working on a shared machine and familiar with SLURM, you may want to wrap this command in a SLURM script.

5) Produce forecasts

  • python3 icenet/predict_heldout_data.py. Uses xarray to save predictions over the validation and test years as (2012-2020) as NetCDFs for IceNet and the linear trend benchmark. IceNet's forecasts are saved in data/forecasts/icenet/<dataloader_ID>/<architecture_ID>/. For IceNet, the full forecast dataset has dimensions (target date, y, x, lead time, ice class, seed), where seed specifies a single ensemble member or the ensemble-mean forecast. An ensemble-mean SIP forecast is also computed and saved as a separate, smaller file (which only has the first four dimensions).

  • Compute IceNet's ensemble-mean temperature scaling parameter for each lead time: python3 icenet/compute_ensemble_mean_temp_scaling.py. The new, ensemble-mean temperature-scaled SIP forecasts are saved to data/forecasts/icenet/<dataloader_ID>/<architecture_ID>/icenet_sip_forecasts_tempscaled.nc. These forecasts represent the final ensemble-mean IceNet model used for the paper.

6) Analyse forecasts

  • python3 icenet/analyse_heldout_predictions.py. Loads the NetCDF forecast data and computes forecast metrics, storing results in a global pandas DataFrame with MultiIndex (model, ensemble member, lead time, target date) and columns for each metric (binary accuracy and sea ice extent error). Uses dask to avoid loading the entire forecast datasets into memory, processing chunks in parallel to significantly speed up the analysis. Results are saved as CSV files in results/forecast_results/ with a timestamp to avoid overwriting. Optionally pre-load the latest CSV file to append new models or metrics to the results without needing to re-analyse existing models. Use this feature to append forecast results from other IceNet models (identified by their dataloader ID and architecture ID) to track the effect of design changes on forecast performance.

  • python3 icenet/analyse_uncertainty.py. Assesses the calibration of IceNet and SEAS5's SIP forecasts. Also determines IceNet's ice edge region and assesses its ice edge bounding ability. Results are saved in results/uncertainty_results/.

7) Run the permute-and-predict method to explore IceNet's most important input variables

  • python3 icenet/permute_and_predict.py. Results are stored in results/permute_and_predict_results/.

8) Generate the paper figures and tables

  • python3 icenet/plot_paper_figures.py. Figures are saved in figures/paper_figures/. Note, you will need the Sea Ice Outlook error CSV file to plot Supp. Fig. 5:
wget -O data/sea_ice_outlook_errors.csv 'https://ramadda.data.bas.ac.uk/repository/entry/get/sea_ice_outlook_errors.csv?entryid=synth%3A71820e7d-c628-4e32-969f-464b7efb187c%3AL3Jlc3VsdHMvb3V0bG9va19lcnJvcnMvc2VhX2ljZV9vdXRsb29rX2Vycm9ycy5jc3Y%3D'

Misc

  • icenet/utils.py defines IceNet utility functions like the data preprocessor, data loader, ERA5 and CMIP6 processing, learning rate decay, and video functionality.
  • icenet/models.py defines network architectures.
  • icenet/config.py defines globals.
  • icenet/losses.py defines loss functions.
  • icenet/callbacks.py defines training callbacks.
  • icenet/metrics.py defines training metrics.

Project structure: simplified output from tree

.
├── data
│   ├── obs
│   ├── cmip6
│   │   ├── EC-Earth3
│   │   │   ├── r10i1p1f1
│   │   │   ├── r12i1p1f1
│   │   │   ├── r14i1p1f1
│   │   │   ├── r2i1p1f1
│   │   │   └── r7i1p1f1
│   │   └── MRI-ESM2-0
│   │       ├── r1i1p1f1
│   │       ├── r2i1p1f1
│   │       ├── r3i1p1f1
│   │       ├── r4i1p1f1
│   │       └── r5i1p1f1
│   ├── forecasts
│   │   ├── icenet
│   │   │   ├── 2021_06_15_1854_icenet_nature_communications
│   │   │   │   └── unet_tempscale
│   │   │   └── 2021_06_30_0954_icenet_pretrain_ablation
│   │   │       └── unet_tempscale
│   │   ├── linear_trend
│   │   └── seas5
│   │       ├── EASE
│   │       └── latlon
│   ├── masks
│   └── network_datasets
│       └── dataset1
│           ├── meta
│           ├── obs
│           ├── transfer
│           └── norm_params.json
├── dataloader_configs
│   ├── 2021_06_15_1854_icenet_nature_communications.json
│   └── 2021_06_30_0954_icenet_pretrain_ablation.json
├── figures
├── icenet
├── logs
│   ├── cmip6_download_logs
│   ├── era5_download_logs
│   ├── seas5_download_logs
│   └── wind_rotation_logs
├── results
│   ├── forecast_results
│   │   └── 2021_07_01_183913_forecast_results.csv
│   ├── permute_and_predict_results
│   │   └── permute_and_predict_results.csv
│   └── uncertainty_results
│       ├── ice_edge_region_results.csv
│       ├── sip_bounding_results.csv
│       └── uncertainty_results.csv
└── trained_networks
    └── 2021_06_15_1854_icenet_nature_communications
        ├── obs_train_val_data
        │   ├── numpy
        │   └── tfrecords
        │       ├── train
        │       └── val
        └── unet_tempscale
            └── networks
                ├── network_tempscaled_36.h5
                ├── network_tempscaled_37.h5
                :

Acknowledgements

Thanks to James Byrne (BAS) and Tony Phillips (BAS) for direct contributions to this codebase.

icenet-paper's People

Contributors

acocac avatar tom-andersson avatar ttgmichael avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

icenet-paper's Issues

Issues with downloading cmip6 MRI-ESM2-0 data

I don't think this is the result of mis-configuration etc at my end - from my understanding this part of the code doesn't depend on any API keys, or any complex python dependencies being properly set up etc.

FWIW I've been trying to run this on Windows, with shell scripts slightly modified to .cmd equivalents... which feels like a bit of a sub-optimal approach, and I'm fairly sure not directly relevant here, although earlier I had another problem that was to do with a weirdly escaped character making its way into some config somewhere. I suppose it's marginally plausible that something related could have somehow snuck in (fairly infinitesimal chance I'd say).

It appears that with --source_id = MRI-ESM-2-0, download_cmip6_data.py is getting empty results from esgf_search(). EC bits completed ok AFAICT... I dug in a little bit more, and find that, for example, a request to

https://esgf-node.llnl.gov/esg-search/search/?source_id=MRI-ESM2-0&member_id=r1i1p1f1&frequency=mon&variable_id=siconca&table_id=SImon&grid_label=gn&experiment_id=ssp245&data_node=esgf-data2.diasjp.net&project=CMIP6&type=File&latest=true&format=application%2Fsolr%2Bjson&offset=0

returns

{
    "responseHeader": {
        "status": 0,
        "QTime": 1343,
        "params": {
            "df": "text",
            "q.alt": "*:*",
            "indent": "true",
            "echoParams": "all",
            "fl": "*,score",
            "start": "0",
            "fq": [
                "type:File",
                "source_id:\"MRI-ESM2-0\"",
                "member_id:\"r1i1p1f1\"",
                "frequency:\"mon\"",
                "variable_id:\"siconca\"",
                "table_id:\"SImon\"",
                "grid_label:\"gn\"",
                "experiment_id:\"ssp245\"",
                "data_node:\"esgf-data2.diasjp.net\"",
                "project:\"CMIP6\"",
                "latest:true"
            ],
            "sort": "id asc",
            "rows": "10",
            "q": "*:*",
            "shards": "localhost:8983/solr/files,localhost:8985/solr/files,localhost:8988/solr/files,localhost:8990/solr/files,localhost:8993/solr/files,localhost:8995/solr/files,localhost:8987/solr/files",
            "tie": "0.01",
            "facet.limit": "2048",
            "qf": "text",
            "facet.method": "fc",
            "facet.mincount": "1",
            "wt": "json",
            "facet.sort": "lex"
        }
    },
    "response": {
        "numFound": 0,
        "start": 0,
        "maxScore": 0.0,
        "docs": []
    }
}

p.s. excuse the silly internet pseudonym; this is Peter, we met at AI UK...

Reproducing the forecasts

Hello,
If I were to reproduce the forecasts shown in the research-paper. What are my options? Are the weights available anywhere for me to use or do I have to train the model locally.

./download_seas5_forecasts_in_parallel.sh laton folder is empty

Hello,
After checking the logs and data folder,one of the logs is:

Regridding to EASE... ut_scale(): NULL factor argument
ut_are_convertible(): NULL unit argument
/scratch/zv32/cd8380/mambaforge/envs/icenet/lib/python3.7/site-packages/iris/fileformats/_pyke_rules/compiled_krb/fc_rules_cf_fc.py:2190: UserWarning: Ignoring netCDF variable 'siconc' invalid units '(0 - 1)'
  warnings.warn(msg)
Deleting existing file.
Done
Done.

And the laton folder is empty,I run the programme on the cloud server which reported finished successfully,but it seems that there is something wrong

Missing file sea_ice_outlook_errors.csv

Traceback (most recent call last):
  File "icenet/plot_paper_figures.py", line 113, in <module>
    sie_errors_df = pd.read_csv(os.path.join(config.data_folder, 'sea_ice_outlook_errors.csv'), comment='#')
FileNotFoundError: [Errno 2] No such file or directory: 'data/sea_ice_outlook_errors.csv'

Hello @tom-andersson and @JimCircadian
When I run the final plotting programm,there is an error report missing file
I searched the project code file but could not find code that generate or download sea_ice_outlook_errors.csv

I have followed the steps in the readme file,in order to save time I omitted the step of executing the permute-and-predict method, but it does not affect the download or generation of the file sea_ice_outlook_errors.csv. All the previous steps are successfully executed without any error.

I am wondering how is this file downloaded or generated?
Or if I missed or did something wrong

Thank you

Why python3 icenet/download_cmip6_data.py takes so much memory space to run

Hello I have run with another issue,when I run download_cmip6_data_in_parallel.sh Ubuntu will kill all 5 commands with EC-Earth3 download.
So I try to run the first command of download_cmip6_data_in_parallel.sh to see what happend,just run this command:
python3 icenet/download_cmip6_data.py --source_id EC-Earth3 --member_id r2i1p1f1 > logs/cmip6_download_logs/EC_r2i1p1f1.txt 2>&1 &

or equally directly run download_cmip6_data.py with default:EC-Earth3 r2i1p1f1
At first when I run this program my memory usage is up to 80%(I have terminated almost all other processes)
Unfortunately the process is still killed by Ubuntu,when it comes to Generating video stage

My laptop memory is 2*8=16GB but still run out.Does that mean this project can only be running with a local machcine with memory larger(32GB or more?) or can only be trained on AWS or Azure(Clous Server) ?

question about models.py

Hi:
Can the model file in (icenet->model.py) be considered as a Spatio-temporal prediction model?

Can't download data when runing .sh file

Hello Mr.Andersson:

  Could you please help me to figure out why this problem happens  while I run the icenet project code,thank you.

  when I try to run the following command.

./download_era5_data_in_parallel.sh

./download_cmip6_data_in_parallel.sh

./rotate_wind_data_in_parallel.sh

./download_seas5_forecasts_in_parallel.sh

 The first three .sh files just run in 1 second and the last download_seas5_forecasts_in_parallel.sh runs like that

图片1

I am still waiting for ECMWF to approve my access data application,but I already set up a CDS account and populate .cdsapirc file according to the guidance.
it seems that
./download_era5_data_in_parallel.sh

./download_cmip6_data_in_parallel.sh doesn't download anything.

I am confused why these two .sh files do not work.

Then when I execute python3 icenet/biascorrect_seas5_forecasts.py
图片2

I know because I haven't got access to ECMWF data,but it seems that I couldn't download era5 and cmip6 data either.

gen_masks.py/download_sic_data.py can work and download data correctly

Thank you for your help

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.