Is there any rule of thumb for nequip memory requirements? I have a 7000 configuration

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Ok so if I understand correctly, I should do, <div class="snippet-clipboard-conten

Great, glad this resolved your issue <a class="user-mention notranslate" data-hovercar

Nequip memory requirements ❓ [QUESTION] about nequip HOT 10 CLOSED

mir-group commented on May 20, 2024

Nequip memory requirements ❓ [QUESTION]

from nequip.

Comments (10)

simonbatzner commented on May 20, 2024 1

Yes, that's correct, just be sure that nequip-benchmark actually uses the same dataset config as your final training and that you make sure you use the same directory to store the data in.

Rule of thumb, I usually don't need more than 30GB, in particular for a dataset as small as yours. If it OOMs on the CPU front, I increase to 50GB or so, the absolute worst case I've used 500GB, but that was for a massive data set. More cores themselves won't help, but if they come with more memory that will help.

from nequip.

Linux-cpp-lisp commented on May 20, 2024 1

Will allocating more cores help in nequip-benchmark (is it parallel)?
More cores themselves won't help, but if they come with more memory that will help.

This is correct, but just want to note a caveat for others reading this issue--- the ASE dataset loader/preprocessor (like used for extXYZ files, our general recommendation for dataset format) is parallelized over CPU cores and can use as many as you throw at it on a single node (it is not MPI parallelized). It should autodetect the available cores from SLURM environment variables, but you can also set that manually with the NEQUIP_NUM_TASKS environment variable.

Also note that the CPU RAM demands of training after preprocessing is complete are generally lower than preprocessing.

from nequip.

ipcamit commented on May 20, 2024 1

Thanks for the help. In case anyone stumble on this in future, instead of using npz, taking cue from the conversation above I instead appended all my data in a single xyz file, and used ase datatype, and ran nequip-benchmark first. As it is also thread parallel, it processed the data in relatively modest resources of 5 cores and 50 GB total ram (I think it can go much lower, but didn't test, though on my laptop I could do 1000 configs easily in 8 cores, 16 gb ram). I made the following changes to my example.yaml file to map fields properly.

dataset: ase
dataset_file_name: /pathe/to/Si_all.xyz
dataset_key_mapping:
  forces: forces 
  Energy: total_energy # extxyz file has energy stores as key Energy
  pbc:  PBC

chemical_symbols:
  - Si

dataset_include_frames: !!python/object/apply:builtins.range
  - 0
  - 7400
  - 1

This processed the data easily. And now running nequip-train is training the model on GPU.

Thanks

from nequip.

Linux-cpp-lisp commented on May 20, 2024 1

Oh and just noting: our PBC key is actually also lowercase pbc, so I think you can remove the last mapping— ase should handle it anyway, though. And identity mappings like forces: forces should also be unnecessary; if you find it to be needed please let me know as I'd consider that a bug.

from nequip.

Linux-cpp-lisp commented on May 20, 2024

Hi @ipcamit ,

100GB RAM is far, far more than you should need on the CPU side--- I suspect you are actually running out of GPU memory due to either a large cutoff, batch size, model, or all three. Can you post your actual error and that information?

from nequip.

ipcamit commented on May 20, 2024

The HPC specification says there is 80 GB memory on the GPU cards.
Cutoff was 4.0 (now submitted again with 3.77 to check), if I remember correctly , avg. number of neighbors that nequip computed was about 8.3 on my laptop (on a very small toy set of 4 samples). Batch size was 2, and model is just 3 layer version of example.yaml, which is showing 87096 parameters on my laptop.

Error on the hpc is not that descriptive I think:

Torch device: cuda
Processing dataset...
./run.sh: line 3: 729160 Killed                  python nequip/nequip/scripts/train.py example.yaml
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=29229954.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Below is the diff of things I changed.

+ num_layers: 3                                                                     # number of interaction blocks, we find 3-5 to work best
- num_layers: 4                                                                     # number of interaction blocks, we find 3-5 to work best
51,52

+   #dataset_url: http://quantum-machine.org/gdml/data/npz/toluene_ccsd_t.zip           # url to download the npz. optional
+ dataset_file_name: ./Si4.npz
- dataset_url: http://quantum-machine.org/gdml/data/npz/toluene_ccsd_t.zip           # url to download the npz. optional
- dataset_file_name: ./benchmark_data/toluene_ccsd_t-train.npz                       # path to data set file
58

+   P: pbc
61
+   - pbc
64

+   - Si
-   - H
-   - C
67

+ wandb: false                                                                        # we recommend using wandb for logging
- wandb: true                                                                        # we recommend using wandb for logging
77

+ n_train: 3000                                                                       # number of training data
+ n_val: 500                                                                         # number of validation data
- n_train: 100                                                                       # number of training data
- n_val: 50                                                                          # number of validation data
80

+ batch_size: 2 
- batch_size: 5

from nequip.

Linux-cpp-lisp commented on May 20, 2024

Hm nevermind, that's a very reasonable set of parameters.

Looks like it's dying before the model is ever called during the neighbor list preprocessing of the dataset. This preprocessing step can be run on a CPU node by calling nequip-benchmark my-config.yaml, where presumably you are able to successfully allocate more CPU RAM to the SLURM job?

from nequip.

ipcamit commented on May 20, 2024

Ok so if I understand correctly, I should do,

nequip-benchmark example.yaml
nequip-train example.yaml

, is that correct?
Like I said is there any rule of thumb on how much ram to request?
Will allocating more cores help in nequip-benchmark (is it parallel)?
Lastly, I can run these scripts explicitly right?

python nequip/benchmark.py example.yaml
python nequip/train.py example.yaml

from nequip.

Linux-cpp-lisp commented on May 20, 2024

Great, glad this resolved your issue @ipcamit , and thank you for documenting it for future users!

from nequip.

simonbatzner commented on May 20, 2024

Awesome, great to hear.

from nequip.

Nequip memory requirements ❓ [QUESTION] about nequip HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent