Giter Site home page Giter Site logo

Comments (10)

simonbatzner avatar simonbatzner commented on May 20, 2024 1

Yes, that's correct, just be sure that nequip-benchmark actually uses the same dataset config as your final training and that you make sure you use the same directory to store the data in.

Rule of thumb, I usually don't need more than 30GB, in particular for a dataset as small as yours. If it OOMs on the CPU front, I increase to 50GB or so, the absolute worst case I've used 500GB, but that was for a massive data set. More cores themselves won't help, but if they come with more memory that will help.

from nequip.

Linux-cpp-lisp avatar Linux-cpp-lisp commented on May 20, 2024 1

Will allocating more cores help in nequip-benchmark (is it parallel)?
More cores themselves won't help, but if they come with more memory that will help.

This is correct, but just want to note a caveat for others reading this issue--- the ASE dataset loader/preprocessor (like used for extXYZ files, our general recommendation for dataset format) is parallelized over CPU cores and can use as many as you throw at it on a single node (it is not MPI parallelized). It should autodetect the available cores from SLURM environment variables, but you can also set that manually with the NEQUIP_NUM_TASKS environment variable.

Also note that the CPU RAM demands of training after preprocessing is complete are generally lower than preprocessing.

from nequip.

ipcamit avatar ipcamit commented on May 20, 2024 1

Thanks for the help. In case anyone stumble on this in future, instead of using npz, taking cue from the conversation above I instead appended all my data in a single xyz file, and used ase datatype, and ran nequip-benchmark first. As it is also thread parallel, it processed the data in relatively modest resources of 5 cores and 50 GB total ram (I think it can go much lower, but didn't test, though on my laptop I could do 1000 configs easily in 8 cores, 16 gb ram). I made the following changes to my example.yaml file to map fields properly.

dataset: ase
dataset_file_name: /pathe/to/Si_all.xyz
dataset_key_mapping:
  forces: forces 
  Energy: total_energy # extxyz file has energy stores as key Energy
  pbc:  PBC

chemical_symbols:
  - Si

dataset_include_frames: !!python/object/apply:builtins.range
  - 0
  - 7400
  - 1

This processed the data easily. And now running nequip-train is training the model on GPU.

Thanks

from nequip.

Linux-cpp-lisp avatar Linux-cpp-lisp commented on May 20, 2024 1

Oh and just noting: our PBC key is actually also lowercase pbc, so I think you can remove the last mapping— ase should handle it anyway, though. And identity mappings like forces: forces should also be unnecessary; if you find it to be needed please let me know as I'd consider that a bug.

from nequip.

Linux-cpp-lisp avatar Linux-cpp-lisp commented on May 20, 2024

Hi @ipcamit ,

100GB RAM is far, far more than you should need on the CPU side--- I suspect you are actually running out of GPU memory due to either a large cutoff, batch size, model, or all three. Can you post your actual error and that information?

from nequip.

ipcamit avatar ipcamit commented on May 20, 2024

The HPC specification says there is 80 GB memory on the GPU cards.
Cutoff was 4.0 (now submitted again with 3.77 to check), if I remember correctly , avg. number of neighbors that nequip computed was about 8.3 on my laptop (on a very small toy set of 4 samples). Batch size was 2, and model is just 3 layer version of example.yaml, which is showing 87096 parameters on my laptop.

Error on the hpc is not that descriptive I think:

Torch device: cuda
Processing dataset...
./run.sh: line 3: 729160 Killed                  python nequip/nequip/scripts/train.py example.yaml
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=29229954.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Below is the diff of things I changed.

+ num_layers: 3                                                                     # number of interaction blocks, we find 3-5 to work best
- num_layers: 4                                                                     # number of interaction blocks, we find 3-5 to work best
51,52

+   #dataset_url: http://quantum-machine.org/gdml/data/npz/toluene_ccsd_t.zip           # url to download the npz. optional
+ dataset_file_name: ./Si4.npz
- dataset_url: http://quantum-machine.org/gdml/data/npz/toluene_ccsd_t.zip           # url to download the npz. optional
- dataset_file_name: ./benchmark_data/toluene_ccsd_t-train.npz                       # path to data set file
58

+   P: pbc
61
+   - pbc
64

+   - Si
-   - H
-   - C
67

+ wandb: false                                                                        # we recommend using wandb for logging
- wandb: true                                                                        # we recommend using wandb for logging
77

+ n_train: 3000                                                                       # number of training data
+ n_val: 500                                                                         # number of validation data
- n_train: 100                                                                       # number of training data
- n_val: 50                                                                          # number of validation data
80

+ batch_size: 2 
- batch_size: 5 

from nequip.

Linux-cpp-lisp avatar Linux-cpp-lisp commented on May 20, 2024

Hm nevermind, that's a very reasonable set of parameters.

Looks like it's dying before the model is ever called during the neighbor list preprocessing of the dataset. This preprocessing step can be run on a CPU node by calling nequip-benchmark my-config.yaml, where presumably you are able to successfully allocate more CPU RAM to the SLURM job?

from nequip.

ipcamit avatar ipcamit commented on May 20, 2024

Ok so if I understand correctly, I should do,

nequip-benchmark example.yaml
nequip-train example.yaml

, is that correct?
Like I said is there any rule of thumb on how much ram to request?
Will allocating more cores help in nequip-benchmark (is it parallel)?
Lastly, I can run these scripts explicitly right?

python nequip/benchmark.py example.yaml
python nequip/train.py example.yaml

from nequip.

Linux-cpp-lisp avatar Linux-cpp-lisp commented on May 20, 2024

Great, glad this resolved your issue @ipcamit , and thank you for documenting it for future users!

from nequip.

simonbatzner avatar simonbatzner commented on May 20, 2024

Awesome, great to hear.

from nequip.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.