ClimSim: An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate simulators

This repository contains the code necessary to download and preprocess the data, and create, train, and evaluate the baseline models in the paper.

Dataset Information

Data from multi-scale climate model (E3SM-MMF) simulations were saved at 20-minute intervals for 10 simulated years. Two netCDF files---input and ouput (target)---are produced at each timestep, totaling 525,600 files for each configuration. 3 configurations of E3SM-MMF were run:

High-Resolution Real Geography
- 1.5° x 1.5° horizontal resolution (21,600 grid columns)
- 5.7 billion total samples (41.2 TB)
- 102 MB per input file, 61 MB per output file
Low-Resolution Real Geography
- 11.5° x 11.5° horizontal resolution (384 grid columns)
- 100 million total samples (744 GB)
- 1.9 MB per input file, 1.1 MB per output file
Low-Resolution Aquaplanet
- 11.5° x 11.5° horizontal resolution (384 grid columns)
- 100 million total samples (744 GB)
- 1.9 MB per input file, 1.1 MB per output file

Scalar variables vary in time and horizontal space ("ncol"), while vertically-resolved variables vary additionally in vertical space ("lev"). The full list of variables can be found in Supplmentary Information Table 1. The subset of variables used in our experiments is shown below:

Input	Target	Variable	Description	Units	Dimensions
X		T	Air temperature	K	(lev, ncol)
X		q	Specific humidity	kg/kg	(lev, ncol)
X		PS	Surface pressure	Pa	(ncol)
X		SOLIN	Solar insolation	W/m²	(ncol)
X		LHFLX	Surface latent heat flux	W/m²	(ncol)
X		SHFLX	Surface sensible heat flux	W/m²	(ncol)
	X	dT/dt	Heating tendency	K/s	(lev, ncol)
	X	dq/dt	Moistening tendency	kg/kg/s	(lev, ncol)
	X	NETSW	Net surface shortwave flux	W/m²	(ncol)
	X	FLWDS	Downward surface longwave flux	W/m²	(ncol)
	X	PRECSC	Snow rate	m/s	(ncol)
	X	PRECC	Rain rate	m/s	(ncol)
	X	SOLS	Visible direct solar flux	W/m²	(ncol)
	X	SOLL	Near-IR direct solar flux	W/m²	(ncol)
	X	SOLSD	Visible diffuse solar flux	W/m²	(ncol)
	X	SOLLD	Near-IR diffuse solar flux	W/m²	(ncol)

Download the Data

The input ("mli") and target ("mlo") data for all E3SM-MMF configurations can be downloaded from Hugging Face:

Preprocess the Data

The default preprocessing workflow takes folders of monthly data from the climate model simulations, and creates normalized NumPy arrays for input and target data for training, validation, and scoring. These NumPy arrays are called train_input.npy, train_target.npy, val_input.npy, val_target.npy, scoring_input.npy, and scoring_target.npy. An option to strictly use a data loader and avoid converting into NumPy arrays is available in data_utils.py; however, this can slow down training because of increased I/O.

The data comes in the form of folders labeled YYYY-MM, which corresponds to the simulation year (YYYY) and month (MM). Within each of these folders are netCDF (.nc) files that represent inputs and targets for individual timesteps. Input files are labeled E3SM-MMF.mli.YYYY-MM-DD-SSSSS.nc where DD-SSSSS corresponds to the day of the month (DD) and seconds of the day (SSSSS), with timesteps being spaced 1,200 seconds (20 minutes) apart. Target files are labeled the same way, except mli is replaced by mlo. For vertically-resolved variables, lower indices corresponds to higher levels in the atmosphere. This is because pressure decreases monotonically with altitude.

The files containing the default normalization factors for the input and target data are found in the norm_factors/ folder, precomputed for convenience. However, one can use their own normalization factors if desired. The file containing the E3SM-MMF grid information is found in the grid_info/ folder. This corresponds to the netCDF file ending in grid-info.nc on Hugging Face.

The environment needed for preprocessing can be found in the /preprocessing/env/requirements.txt file. A class designed for preprocessing and metrics can be imported from the data_utils.py script. This script is used in the preprocessing/create_npy_data_splits.ipynb notebook, which creates training, validation, and scoring datasets.

By default, training and validation data subsample every 7$^{\text{th}}$ timestep while scoring data subsamples every 6$^{\text{th}}$ timestep to enable daily-averaged metrics. Training data is taken from the second month of simulation year 1 through the first month of simulation year 8 (i.e., 0001-02 through 0008-01). Both validation and scoring data are taken from 0008-02 through 0009-01. However, the data_utils.py allows the user to easily change these defaults assuming knowledge of regular expressions. To see how this works, please reference preprocessing/create_npy_data_splits.ipynb.

Baseline Models

Six different baseline models were created and trained:

Convolutional neural network (CNN)
Encoder-decoder (ED)
Heteroskedastic regression (HSR)
Multi-layer perceptron (MLP)
Randomized prior network (RPN)
Conditional variational autoencoder (cVAE)

Jupyter Notebooks describing how to load and train simple CNN and MLP models are found in the demo_notebooks/ folder. The environments and code used to train each model, as well as the pre-trained models, are found in the baseline_models/ folder.

Evaluation

Four different evaluation metrics were calculated:

Mean absolute error (MAE)
Coefficient of determination (R²)
Root mean squared error (RMSE)
Continuous ranked probability score (CRPS)

Evaluation and comparison of the different baseline models are found in the metrics_and_figures/ folder. All variables are converted to a common energy unit (i.e., W/m²) for scoring. The scoring is done using the functions in metrics_and_figures/data_utils.py.

Evaluation metrics are computed separately for each horizontally-averaged, vertically-averaged, and time-averaged target variable. The performance for each baseline model for all four metrics is shown below:

MAE (W/m²)	CNN	ED	HSR	MLP	RPN	cVAE
dT/dt	2.585	2.684	2.845	2.683	2.685	2.732
dq/dt	4.401	4.673	4.784	4.495	4.592	4.680
NETSW	18.85	14.968	19.82	13.36	18.88	19.73
FLWDS	8.598	6.894	6.267	5.224	6.018	6.588
PRECSC	3.364	3.046	3.511	2.684	3.328	3.322
PRECC	37.83	37.250	42.38	34.33	37.46	38.81
SOLS	10.83	8.554	11.31	7.97	10.36	10.94
SOLL	13.15	10.924	13.60	10.30	12.96	13.46
SOLSD	5.817	5.075	6.331	4.533	5.846	6.159
SOLLD	5.679	5.136	6.215	4.806	5.702	6.066

R²	CNN	ED	HSR	MLP	RPN	cVAE
dT/dt	0.627	0.542	0.568	0.589	0.617	0.590
dq/dt	--	--	--	--	--	--
NETSW	0.944	0.980	0.959	0.983	0.968	0.957
FLWDS	0.828	0.802	0.904	0.924	0.912	0.883
PRECSC	--	--	--	--	--	--
PRECC	0.077	-17.909	-68.35	-38.69	-67.94	-0.926
SOLS	0.927	0.960	0.929	0.961	0.943	0.929
SOLL	0.916	0.945	0.916	0.948	0.928	0.915
SOLSD	0.927	0.951	0.923	0.956	0.940	0.921
SOLLD	0.813	0.857	0.797	0.866	0.837	0.796

RMSE (W/m²)	CNN	ED	HSR	MLP	RPN	cVAE
dT/dt	4.369	4.696	4.825	4.421	4.482	4.721
dq/dt	7.284	7.643	7.896	7.322	7.518	7.780
NETSW	36.91	28.537	37.77	26.71	33.60	38.36
FLWDS	10.86	9.070	8.220	6.969	7.914	8.530
PRECSC	6.001	5.078	6.095	4.734	5.511	6.182
PRECC	85.31	76.682	90.64	72.88	76.58	88.71
SOLS	22.92	17.999	23.61	17.40	20.61	23.27
SOLL	27.25	22.540	27.78	21.95	25.22	27.81
SOLSD	12.13	9.917	12.40	9.420	11.00	12.64
SOLLD	12.10	10.417	12.47	10.12	11.25	12.63

CRPS (W/m²)	CNN	ED	HSR	MLP	RPN	cVAE
dT/dt	--	--	3.284	--	2.580	2.795
dq/dt	--	--	4.899	--	4.022	4.372
NETSW	--	--	0.055	--	0.053	0.057
FLWDS	--	--	0.018	--	0.016	0.018
PRECSC	--	--	0.011	--	0.008	0.009
PRECC	--	--	0.122	--	0.085	0.097
SOLS	--	--	0.031	--	0.028	0.033
SOLL	--	--	0.038	--	0.035	0.040
SOLSD	--	--	0.018	--	0.015	0.016
SOLLD	--	--	0.017	--	0.015	0.016

The metrics_and_figures/ClimSim_metrics.ipynb and metrics_and_figures/crps_clean.py scripts calculate and plot MAE, R², RMSE, and CRPS scores for each baseline model. The separate R² for longitudinally-averaged and time-averaged 3D variables is found in plot_R2_analysis.ipynb.

noraloose / climsim Goto Github PK

climsim's Introduction

ClimSim: An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate simulators

Dataset Information

Download the Data

Preprocess the Data

Baseline Models

Evaluation

climsim's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent