nvlabs / fourcastnet Goto Github PK

View Code? Open in Web Editor NEW

480.0 23.0 121.0 2.16 MB

Initial public release of code, data, and model weights for FourCastNet

License: Other

Python 98.16% Dockerfile 0.48% Shell 1.36%

earth-science

fourcastnet's Issues

Help editing code to run train.py on smaller h5 files

thank you for releasing this amazing repo!

I found the reason for the error - it has do with the number of in_channels specified in the AFNO.yaml file

I changed in_channels to [0, 1, 2] and I don't get the error now

I'm closing this issue, but it'd be great if someone could give me some insight or add a short note on the changes to be made to run train the model on smaller h5 files

thank you again for releasing the repo - I look forward to understanding the code better :)

==================

Original issue

I'm able to run train.py with the large h5 files available on Globus.

When I try to run train.py with the smaller h5 files (regional or era5_subsample) made available on the NERSC portal, the following line throws an error:

https://github.com/NVlabs/FourCastNet/blob/master/utils/data_loader_multifiles.py#L207

specifically:

self.files[year_idx][(local_idx-self.dt*self.n_history):(local_idx+1):self.dt, self.in_channels] throws the following error

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/user/anaconda3/envs/climax/lib/python3.8/site-packages/h5py/_hl/dataset.py", line 710, in __getitem__
    return self._fast_reader.read(args)
  File "h5py/_selector.pyx", line 351, in h5py._selector.Reader.read
  File "h5py/_selector.pyx", line 198, in h5py._selector.Selector.apply_args
IndexError: Fancy indexing out of range for (0-2)

the shape of self.files[year_idx] for the larger h5 files in Globus is

HDF5 dataset: shape (1460, 21, 721, 1440)

the shape of the self.files[year_idx] for the smaller h5 files on NERSC - regional or era5_subsample is

HDF5 dataset: shape (1460, 3, 360, 360)

I'm not very familiar with h5py yet - could someone please help me edit the code on https://github.com/NVlabs/FourCastNet/blob/master/utils/data_loader_multifiles.py#L207 so I can run train.py on the smaller h5 files .

is there some edit to the AFNO.yaml besides the file paths that needs to be made to train the model on smaller h5 files?

thank you! @jdppthk @MortezaMardani

Pre-processing parallel_copy.py

Thank you for your great code, this is SOTA model.
I had an issue with running the pre-processing parallel_copy.py or MPI.py (similar to parallel_copy.py but it's has different number of years) by running the exact datasets for full year(2016-2021) and still got the error which is: KeyError "ValueError: h5py was built without MPI support, can't use mpio driver"

I installed OpenMPI, mpi4py

(cast) mg@amru-System-Product-Name:~$ mpiexec -n 5 python -m mpi4py.bench helloworld
Hello, World! I am process 0 of 5 on amru-System-Product-Name.
Hello, World! I am process 1 of 5 on amru-System-Product-Name.
Hello, World! I am process 2 of 5 on amru-System-Product-Name.
Hello, World! I am process 3 of 5 on amru-System-Product-Name.
Hello, World! I am process 4 of 5 on amru-System-Product-Name.

I don't know what causes this problem because in my point of view everything must be ok with the code and datasets.

(cast) mg@amru-System-Product-Name:~/Documents/Data$ mpirun -n 4 python MPI.py 
{2016: 'j', 2017: 'j', 2018: 'k', 2019: 'k', 2020: 'a', 2021: 'a'}
2016
{2016: 'j', 2017: 'j', 2018: 'k', 2019: 'k', 2020: 'a', 2021: 'a'}
2016
==============================
rank 1
Nproc 4
==============================
Nimgtot 1460
Nproc 4
Nimg 365
Traceback (most recent call last):
  File "MPI.py", line 130, in <module>
    with h5py.File(f'{str(year)}.h5', 'w') as f:
  File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 442, in __init__
    fid = make_fid(name, mode, userblock_size,
  File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 201, in make_fid
    fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 116, in h5py.h5f.create
BlockingIOError: [Errno 11] Unable to create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
{2016: 'j', 2017: 'j', 2018: 'k', 2019: 'k', 2020: 'a', 2021: 'a'}
2016
{2016: 'j', 2017: 'j', 2018: 'k', 2019: 'k', 2020: 'a', 2021: 'a'}
2016
Traceback (most recent call last):
  File "MPI.py", line 133, in <module>
    writetofile(src, dest, 0, ['u10'])
  File "MPI.py", line 75, in writetofile
    fdest = h5py.File(dest, 'a', driver='mpio', comm=MPI.COMM_WORLD)
  File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 441, in __init__
    fapl = make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0, **kwds)
  File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 144, in make_fapl
    set_fapl(plist, **kwds)
  File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 48, in _set_fapl_mpio
    raise ValueError("h5py was built without MPI support, can't use mpio driver")
ValueError: h5py was built without MPI support, can't use mpio driver
Traceback (most recent call last):
  File "MPI.py", line 130, in <module>
    with h5py.File(f'{str(year)}.h5', 'w') as f:
  File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 442, in __init__
    fid = make_fid(name, mode, userblock_size,
  File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 201, in make_fid
    fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 116, in h5py.h5f.create
BlockingIOError: [Errno 11] Unable to create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
==============================
rank 2
Nproc 4
==============================
Nimgtot 1460
Nproc 4
Nimg 365
Traceback (most recent call last):
  File "MPI.py", line 133, in <module>
    writetofile(src, dest, 0, ['u10'])
  File "MPI.py", line 75, in writetofile
    fdest = h5py.File(dest, 'a', driver='mpio', comm=MPI.COMM_WORLD)
  File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 441, in __init__
    fapl = make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0, **kwds)
  File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 144, in make_fapl
    set_fapl(plist, **kwds)
  File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 48, in _set_fapl_mpio
    raise ValueError("h5py was built without MPI support, can't use mpio driver")
ValueError: h5py was built without MPI support, can't use mpio driver
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[3210,1],0]
  Exit code:    1

datasets downloaded from cds.climate.copernicus.eu for each year with 20 paramenters

I just started using mpi4py and h5py, could you please help me to run the prepossessing parallel_copy.py

minor fixes

Check license header for afnonet.py
create a versioning and model weights table
Logos at the top of wiki, center them

Pre-processing stage key error

First of all, thank you for your GREAT code, it is a real game changer.
I had an issue with running the pre-processing stage by running the exact datasets used inside the written example code in this stage (13 days) and still got the error which is: KeyError "Unable to open object (object 'fields' doesn't exist)"

I don't know what causes this problem because in my point of view everything must be ok with the code and datasets.

problem with downloading pre-trained model

I do not have an organization and log into globus through personal identity.
I came across with network issue with personal endpoint. I wonder if there is other place, google cloud drive for example, to download the pretrained model file?

Trouble with downloading data

Hello, thanks for open-sourcing such a well structured code!

I am currently having trouble with trying to download the pre-processed data shared on globus.

Even after logging in, it seems like the download button is disabled. Has anything changed with the permission? What can I do to fix this? Below is the screenshot of what I am facing.

cfgrib.dataset.DatasetBuildError

When I try to run the.grib file, the output image is displayed with an error：cfgrib.dataset.DatasetBuildError: key present and new value is different: key='surface' value=Variable(dimensions=(), data=0.0) new_value=Variable(dimensions=(), data=2.0)，How to solve the problem that the predicted result differs greatly from the actual value

Regarding the pre-trained weight file backbone.ckpt

Regarding the pre-trained weight file backbone.ckpt, is this the weight file of the afno_backbone stage or the weight file of the afno_backbone_finetune stage? I will be extremely grateful for your prompt reply.

AFNO implementation differs from paper

Possibility of using different resolution input data over smaller areas

I have a very general question, which is not clear to me, about this method. and sorry if my question is very basic because I am a noobie in this field.

I ran the code and got the results for the same datasets used inside the paper (ERA5 res 30km) for different dates.
Can I use higher res data (2km) for the smaller area too? and need to train the whole model from the scratch or I can use the checkpoints provided in the paper to run the code for higher-resolution datasets?
Thank you very much for your time! :)

how to reproduce your result on 80*40 resolution features?

Thanks for your amazing work! I want to reproduce your result on lower resolution but I get really bad result on 64*32. How should I change settings?

Checklist before open sourcing

Create universally shared globus directories for model weights, stats and data
Add ECMWF data license
Create staging directory for checkpoints
Rewrite training readme
Rewrite inference readme
Test inference workflow following readme
remove mins and maxs from stats, training and inference workflow

How to understand the core code in FFT?

Hello! I can't understand the calculation method in FFT.
o1_real = x.real * w1[0] - x.imag * w1[1] + b1[0]
o1_imag = x.imag * w1[0] + x.real * w1[1] + b1[1]
o2_real = o1_real * w2[0] - o1_imag * w2[1] + b2[0]
o2_imag = o1_imag * w2[0] + o2_real * w2[1] + b2[1]
Why should the real be '-', '+', '+', the imag be '+', '+', '+'? What role does this calculation combination play in it? Can I change the first '-' to '+' or vice versa?
And what role does 'kept_modes' play in it? Dropout?

TP accuracy

Hello everyone. I have an issue with the TP acc which is extremely low. does anyone know what can be the problem? below you can find the output after an inference and my input data info:

input h5 shape: tp(120, 721, 1440).

inference:
2023-06-25 11:24:06,620 - root - INFO - Timestep 0 of 20. TP RMS Error: 0.0, ACC: 1.0
2023-06-25 11:24:09,065 - root - INFO - Timestep 1 of 20. TP RMS Error: 0.0016368781216442585, ACC: 0.35278499126434326
2023-06-25 11:24:09,388 - root - INFO - Timestep 2 of 20. TP RMS Error: 0.00156010827049613, ACC: 0.3554094731807709
2023-06-25 11:24:09,688 - root - INFO - Timestep 3 of 20. TP RMS Error: 0.0015420995187014341, ACC: 0.3688367009162903
2023-06-25 11:24:09,988 - root - INFO - Timestep 4 of 20. TP RMS Error: 0.0014832873130217195, ACC: 0.3889610171318054
2023-06-25 11:24:10,289 - root - INFO - Timestep 5 of 20. TP RMS Error: 0.0014608813216909766, ACC: 0.36930468678474426
2023-06-25 11:24:10,589 - root - INFO - Timestep 6 of 20. TP RMS Error: 0.0015008833725005388, ACC: 0.32038554549217224
2023-06-25 11:24:10,889 - root - INFO - Timestep 7 of 20. TP RMS Error: 0.0013966681435704231, ACC: 0.36792299151420593
2023-06-25 11:24:11,189 - root - INFO - Timestep 8 of 20. TP RMS Error: 0.001374703599140048, ACC: 0.3730277121067047
2023-06-25 11:24:11,490 - root - INFO - Timestep 9 of 20. TP RMS Error: 0.0013704199809581041, ACC: 0.33114054799079895
2023-06-25 11:24:11,790 - root - INFO - Timestep 10 of 20. TP RMS Error: 0.0014247711515054107, ACC: 0.2770615518093109
2023-06-25 11:24:12,090 - root - INFO - Timestep 11 of 20. TP RMS Error: 0.001329558901488781, ACC: 0.3315066695213318
2023-06-25 11:24:12,390 - root - INFO - Timestep 12 of 20. TP RMS Error: 0.0012841359712183475, ACC: 0.35082289576530457
2023-06-25 11:24:12,690 - root - INFO - Timestep 13 of 20. TP RMS Error: 0.0012776957591995597, ACC: 0.32422271370887756
2023-06-25 11:24:12,990 - root - INFO - Timestep 14 of 20. TP RMS Error: 0.001365577569231391, ACC: 0.2353413850069046
2023-06-25 11:24:13,290 - root - INFO - Timestep 15 of 20. TP RMS Error: 0.001326375175267458, ACC: 0.28706997632980347
2023-06-25 11:24:13,590 - root - INFO - Timestep 16 of 20. TP RMS Error: 0.0013120684307068586, ACC: 0.3122824430465698
2023-06-25 11:24:13,890 - root - INFO - Timestep 17 of 20. TP RMS Error: 0.001352619961835444, ACC: 0.270219087600708
2023-06-25 11:24:14,190 - root - INFO - Timestep 18 of 20. TP RMS Error: 0.0014553270302712917, ACC: 0.1826096624135971
2023-06-25 11:24:14,489 - root - INFO - Timestep 19 of 20. TP RMS Error: 0.0014066090807318687, ACC: 0.23088878393173218

Minor typo in README

The training section refers to National Energy Resarch Scientific Computing Center (NERSC).

I believe it should be "Research".

Unable to download the weights

Hi,

I was looking to perform inference on the trained model and noticed that the download option was not available in the Globus App. Could someone please look into this.

Thanks,
Vignesh

the way to calculate `time_means` in script get_stats.py is wrong

Please see: https://github.com/NVlabs/FourCastNet/blob/master/data_process/get_stats.py

**time_means = np.zeros((1,21,721, 1440))**

for ii, year in enumerate(years):
    
    with h5py.File('/pscratch/sd/s/shas1693/data/era5/train/'+ str(year) + '.h5', 'r') as f:

        rnd_idx = np.random.randint(0, 1460-500)
        global_means += np.mean(f['fields'][rnd_idx:rnd_idx+500], keepdims=True, axis = (0,2,3))
        global_stds += np.var(f['fields'][rnd_idx:rnd_idx+500], keepdims=True, axis = (0,2,3))

global_means = global_means/len(years)
global_stds = np.sqrt(global_stds/len(years))
**time_means = time_means/len(years)**

the time_means is constant zero follow this script.
What is the correct defination for this value?

BTW, may I know how you calculate the time_means_daily.h5 file?
From its size (127G) I can only guess it is a $(1460,21,720,1440)$ tensor.

Smaller version of dataset

Hello,
It will be good to have a relatively smaller dataset for experimentation/learning purposes. By smaller, it may be low spatial resolution, smaller time-period, lesser number of variables etc.

nvlabs / fourcastnet Goto Github PK

fourcastnet's Issues

Original issue

the time_means is constant zero follow this script. What is the correct defination for this value?

Recommend Projects

Recommend Topics

Recommend Org

the time_means is constant zero follow this script.
What is the correct defination for this value?