nvlabs / fourcastnet Goto Github PK
View Code? Open in Web Editor NEWInitial public release of code, data, and model weights for FourCastNet
License: Other
Initial public release of code, data, and model weights for FourCastNet
License: Other
thank you for releasing this amazing repo!
I found the reason for the error - it has do with the number of in_channels
specified in the AFNO.yaml
file
I changed in_channels
to [0, 1, 2]
and I don't get the error now
I'm closing this issue, but it'd be great if someone could give me some insight or add a short note on the changes to be made to run train the model on smaller h5 files
thank you again for releasing the repo - I look forward to understanding the code better :)
==================
I'm able to run train.py
with the large h5 files available on Globus.
When I try to run train.py
with the smaller h5 files (regional or era5_subsample) made available on the NERSC portal, the following line throws an error:
https://github.com/NVlabs/FourCastNet/blob/master/utils/data_loader_multifiles.py#L207
specifically:
self.files[year_idx][(local_idx-self.dt*self.n_history):(local_idx+1):self.dt, self.in_channels]
throws the following error
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/home/user/anaconda3/envs/climax/lib/python3.8/site-packages/h5py/_hl/dataset.py", line 710, in __getitem__
return self._fast_reader.read(args)
File "h5py/_selector.pyx", line 351, in h5py._selector.Reader.read
File "h5py/_selector.pyx", line 198, in h5py._selector.Selector.apply_args
IndexError: Fancy indexing out of range for (0-2)
the shape of self.files[year_idx]
for the larger h5 files in Globus is
HDF5 dataset: shape (1460, 21, 721, 1440)
the shape of the self.files[year_idx]
for the smaller h5 files on NERSC - regional or era5_subsample is
HDF5 dataset: shape (1460, 3, 360, 360)
I'm not very familiar with h5py yet - could someone please help me edit the code on https://github.com/NVlabs/FourCastNet/blob/master/utils/data_loader_multifiles.py#L207 so I can run train.py
on the smaller h5 files .
is there some edit to the AFNO.yaml
besides the file paths that needs to be made to train the model on smaller h5 files?
thank you! @jdppthk @MortezaMardani
Thank you for your great code, this is SOTA model.
I had an issue with running the pre-processing parallel_copy.py or MPI.py (similar to parallel_copy.py but it's has different number of years) by running the exact datasets for full year(2016-2021) and still got the error which is: KeyError "ValueError: h5py was built without MPI support, can't use mpio driver"
I installed OpenMPI, mpi4py
(cast) mg@amru-System-Product-Name:~$ mpiexec -n 5 python -m mpi4py.bench helloworld
Hello, World! I am process 0 of 5 on amru-System-Product-Name.
Hello, World! I am process 1 of 5 on amru-System-Product-Name.
Hello, World! I am process 2 of 5 on amru-System-Product-Name.
Hello, World! I am process 3 of 5 on amru-System-Product-Name.
Hello, World! I am process 4 of 5 on amru-System-Product-Name.
I don't know what causes this problem because in my point of view everything must be ok with the code and datasets.
(cast) mg@amru-System-Product-Name:~/Documents/Data$ mpirun -n 4 python MPI.py
{2016: 'j', 2017: 'j', 2018: 'k', 2019: 'k', 2020: 'a', 2021: 'a'}
2016
{2016: 'j', 2017: 'j', 2018: 'k', 2019: 'k', 2020: 'a', 2021: 'a'}
2016
==============================
rank 1
Nproc 4
==============================
Nimgtot 1460
Nproc 4
Nimg 365
Traceback (most recent call last):
File "MPI.py", line 130, in <module>
with h5py.File(f'{str(year)}.h5', 'w') as f:
File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 442, in __init__
fid = make_fid(name, mode, userblock_size,
File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 201, in make_fid
fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 116, in h5py.h5f.create
BlockingIOError: [Errno 11] Unable to create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
{2016: 'j', 2017: 'j', 2018: 'k', 2019: 'k', 2020: 'a', 2021: 'a'}
2016
{2016: 'j', 2017: 'j', 2018: 'k', 2019: 'k', 2020: 'a', 2021: 'a'}
2016
Traceback (most recent call last):
File "MPI.py", line 133, in <module>
writetofile(src, dest, 0, ['u10'])
File "MPI.py", line 75, in writetofile
fdest = h5py.File(dest, 'a', driver='mpio', comm=MPI.COMM_WORLD)
File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 441, in __init__
fapl = make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0, **kwds)
File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 144, in make_fapl
set_fapl(plist, **kwds)
File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 48, in _set_fapl_mpio
raise ValueError("h5py was built without MPI support, can't use mpio driver")
ValueError: h5py was built without MPI support, can't use mpio driver
Traceback (most recent call last):
File "MPI.py", line 130, in <module>
with h5py.File(f'{str(year)}.h5', 'w') as f:
File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 442, in __init__
fid = make_fid(name, mode, userblock_size,
File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 201, in make_fid
fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 116, in h5py.h5f.create
BlockingIOError: [Errno 11] Unable to create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
==============================
rank 2
Nproc 4
==============================
Nimgtot 1460
Nproc 4
Nimg 365
Traceback (most recent call last):
File "MPI.py", line 133, in <module>
writetofile(src, dest, 0, ['u10'])
File "MPI.py", line 75, in writetofile
fdest = h5py.File(dest, 'a', driver='mpio', comm=MPI.COMM_WORLD)
File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 441, in __init__
fapl = make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0, **kwds)
File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 144, in make_fapl
set_fapl(plist, **kwds)
File "/home/mg/.local/lib/python3.8/site-packages/h5py/_hl/files.py", line 48, in _set_fapl_mpio
raise ValueError("h5py was built without MPI support, can't use mpio driver")
ValueError: h5py was built without MPI support, can't use mpio driver
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[3210,1],0]
Exit code: 1
datasets downloaded from cds.climate.copernicus.eu for each year with 20 paramenters
I just started using mpi4py and h5py, could you please help me to run the prepossessing parallel_copy.py
First of all, thank you for your GREAT code, it is a real game changer.
I had an issue with running the pre-processing stage by running the exact datasets used inside the written example code in this stage (13 days) and still got the error which is: KeyError "Unable to open object (object 'fields' doesn't exist)"
I don't know what causes this problem because in my point of view everything must be ok with the code and datasets.
I do not have an organization and log into globus through personal identity.
I came across with network issue with personal endpoint. I wonder if there is other place, google cloud drive for example, to download the pretrained model file?
Hello, thanks for open-sourcing such a well structured code!
I am currently having trouble with trying to download the pre-processed data shared on globus.
Even after logging in, it seems like the download button is disabled. Has anything changed with the permission? What can I do to fix this? Below is the screenshot of what I am facing.
When I try to run the.grib file, the output image is displayed with an error:cfgrib.dataset.DatasetBuildError: key present and new value is different: key='surface' value=Variable(dimensions=(), data=0.0) new_value=Variable(dimensions=(), data=2.0),How to solve the problem that the predicted result differs greatly from the actual value
Regarding the pre-trained weight file backbone.ckpt, is this the weight file of the afno_backbone stage or the weight file of the afno_backbone_finetune stage? I will be extremely grateful for your prompt reply.
I have a very general question, which is not clear to me, about this method. and sorry if my question is very basic because I am a noobie in this field.
I ran the code and got the results for the same datasets used inside the paper (ERA5 res 30km) for different dates.
Can I use higher res data (2km) for the smaller area too? and need to train the whole model from the scratch or I can use the checkpoints provided in the paper to run the code for higher-resolution datasets?
Thank you very much for your time! :)
Thanks for your amazing work! I want to reproduce your result on lower resolution but I get really bad result on 64*32. How should I change settings?
Hello! I can't understand the calculation method in FFT.
o1_real = x.real * w1[0] - x.imag * w1[1] + b1[0]
o1_imag = x.imag * w1[0] + x.real * w1[1] + b1[1]
o2_real = o1_real * w2[0] - o1_imag * w2[1] + b2[0]
o2_imag = o1_imag * w2[0] + o2_real * w2[1] + b2[1]
Why should the real be '-', '+', '+', the imag be '+', '+', '+'? What role does this calculation combination play in it? Can I change the first '-' to '+' or vice versa?
And what role does 'kept_modes' play in it? Dropout?
Hello everyone. I have an issue with the TP acc which is extremely low. does anyone know what can be the problem? below you can find the output after an inference and my input data info:
input h5 shape: tp(120, 721, 1440).
inference:
2023-06-25 11:24:06,620 - root - INFO - Timestep 0 of 20. TP RMS Error: 0.0, ACC: 1.0
2023-06-25 11:24:09,065 - root - INFO - Timestep 1 of 20. TP RMS Error: 0.0016368781216442585, ACC: 0.35278499126434326
2023-06-25 11:24:09,388 - root - INFO - Timestep 2 of 20. TP RMS Error: 0.00156010827049613, ACC: 0.3554094731807709
2023-06-25 11:24:09,688 - root - INFO - Timestep 3 of 20. TP RMS Error: 0.0015420995187014341, ACC: 0.3688367009162903
2023-06-25 11:24:09,988 - root - INFO - Timestep 4 of 20. TP RMS Error: 0.0014832873130217195, ACC: 0.3889610171318054
2023-06-25 11:24:10,289 - root - INFO - Timestep 5 of 20. TP RMS Error: 0.0014608813216909766, ACC: 0.36930468678474426
2023-06-25 11:24:10,589 - root - INFO - Timestep 6 of 20. TP RMS Error: 0.0015008833725005388, ACC: 0.32038554549217224
2023-06-25 11:24:10,889 - root - INFO - Timestep 7 of 20. TP RMS Error: 0.0013966681435704231, ACC: 0.36792299151420593
2023-06-25 11:24:11,189 - root - INFO - Timestep 8 of 20. TP RMS Error: 0.001374703599140048, ACC: 0.3730277121067047
2023-06-25 11:24:11,490 - root - INFO - Timestep 9 of 20. TP RMS Error: 0.0013704199809581041, ACC: 0.33114054799079895
2023-06-25 11:24:11,790 - root - INFO - Timestep 10 of 20. TP RMS Error: 0.0014247711515054107, ACC: 0.2770615518093109
2023-06-25 11:24:12,090 - root - INFO - Timestep 11 of 20. TP RMS Error: 0.001329558901488781, ACC: 0.3315066695213318
2023-06-25 11:24:12,390 - root - INFO - Timestep 12 of 20. TP RMS Error: 0.0012841359712183475, ACC: 0.35082289576530457
2023-06-25 11:24:12,690 - root - INFO - Timestep 13 of 20. TP RMS Error: 0.0012776957591995597, ACC: 0.32422271370887756
2023-06-25 11:24:12,990 - root - INFO - Timestep 14 of 20. TP RMS Error: 0.001365577569231391, ACC: 0.2353413850069046
2023-06-25 11:24:13,290 - root - INFO - Timestep 15 of 20. TP RMS Error: 0.001326375175267458, ACC: 0.28706997632980347
2023-06-25 11:24:13,590 - root - INFO - Timestep 16 of 20. TP RMS Error: 0.0013120684307068586, ACC: 0.3122824430465698
2023-06-25 11:24:13,890 - root - INFO - Timestep 17 of 20. TP RMS Error: 0.001352619961835444, ACC: 0.270219087600708
2023-06-25 11:24:14,190 - root - INFO - Timestep 18 of 20. TP RMS Error: 0.0014553270302712917, ACC: 0.1826096624135971
2023-06-25 11:24:14,489 - root - INFO - Timestep 19 of 20. TP RMS Error: 0.0014066090807318687, ACC: 0.23088878393173218
The training section refers to National Energy Resarch Scientific Computing Center (NERSC).
I believe it should be "Research".
Hi,
I was looking to perform inference on the trained model and noticed that the download option was not available in the Globus App. Could someone please look into this.
Thanks,
Vignesh
Please see: https://github.com/NVlabs/FourCastNet/blob/master/data_process/get_stats.py
**time_means = np.zeros((1,21,721, 1440))**
for ii, year in enumerate(years):
with h5py.File('/pscratch/sd/s/shas1693/data/era5/train/'+ str(year) + '.h5', 'r') as f:
rnd_idx = np.random.randint(0, 1460-500)
global_means += np.mean(f['fields'][rnd_idx:rnd_idx+500], keepdims=True, axis = (0,2,3))
global_stds += np.var(f['fields'][rnd_idx:rnd_idx+500], keepdims=True, axis = (0,2,3))
global_means = global_means/len(years)
global_stds = np.sqrt(global_stds/len(years))
**time_means = time_means/len(years)**
BTW, may I know how you calculate the time_means_daily.h5
file?
From its size (127G) I can only guess it is a
Hello,
It will be good to have a relatively smaller dataset for experimentation/learning purposes. By smaller, it may be low spatial resolution, smaller time-period, lesser number of variables etc.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.