Comments (10)
Yes, that's correct, just be sure that nequip-benchmark actually uses the same dataset config as your final training and that you make sure you use the same directory to store the data in.
Rule of thumb, I usually don't need more than 30GB, in particular for a dataset as small as yours. If it OOMs on the CPU front, I increase to 50GB or so, the absolute worst case I've used 500GB, but that was for a massive data set. More cores themselves won't help, but if they come with more memory that will help.
from nequip.
Will allocating more cores help in nequip-benchmark (is it parallel)?
More cores themselves won't help, but if they come with more memory that will help.
This is correct, but just want to note a caveat for others reading this issue--- the ASE dataset loader/preprocessor (like used for extXYZ files, our general recommendation for dataset format) is parallelized over CPU cores and can use as many as you throw at it on a single node (it is not MPI parallelized). It should autodetect the available cores from SLURM environment variables, but you can also set that manually with the NEQUIP_NUM_TASKS
environment variable.
Also note that the CPU RAM demands of training after preprocessing is complete are generally lower than preprocessing.
from nequip.
Thanks for the help. In case anyone stumble on this in future, instead of using npz, taking cue from the conversation above I instead appended all my data in a single xyz file, and used ase
datatype, and ran nequip-benchmark
first. As it is also thread parallel, it processed the data in relatively modest resources of 5 cores and 50 GB total ram (I think it can go much lower, but didn't test, though on my laptop I could do 1000 configs easily in 8 cores, 16 gb ram). I made the following changes to my example.yaml file to map fields properly.
dataset: ase
dataset_file_name: /pathe/to/Si_all.xyz
dataset_key_mapping:
forces: forces
Energy: total_energy # extxyz file has energy stores as key Energy
pbc: PBC
chemical_symbols:
- Si
dataset_include_frames: !!python/object/apply:builtins.range
- 0
- 7400
- 1
This processed the data easily. And now running nequip-train
is training the model on GPU.
Thanks
from nequip.
Oh and just noting: our PBC key is actually also lowercase pbc
, so I think you can remove the last mapping— ase
should handle it anyway, though. And identity mappings like forces: forces
should also be unnecessary; if you find it to be needed please let me know as I'd consider that a bug.
from nequip.
Hi @ipcamit ,
100GB RAM is far, far more than you should need on the CPU side--- I suspect you are actually running out of GPU memory due to either a large cutoff, batch size, model, or all three. Can you post your actual error and that information?
from nequip.
The HPC specification says there is 80 GB memory on the GPU cards.
Cutoff was 4.0 (now submitted again with 3.77 to check), if I remember correctly , avg. number of neighbors that nequip computed was about 8.3 on my laptop (on a very small toy set of 4 samples). Batch size was 2, and model is just 3 layer version of example.yaml, which is showing 87096 parameters on my laptop.
Error on the hpc is not that descriptive I think:
Torch device: cuda
Processing dataset...
./run.sh: line 3: 729160 Killed python nequip/nequip/scripts/train.py example.yaml
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=29229954.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
Below is the diff of things I changed.
+ num_layers: 3 # number of interaction blocks, we find 3-5 to work best
- num_layers: 4 # number of interaction blocks, we find 3-5 to work best
51,52
+ #dataset_url: http://quantum-machine.org/gdml/data/npz/toluene_ccsd_t.zip # url to download the npz. optional
+ dataset_file_name: ./Si4.npz
- dataset_url: http://quantum-machine.org/gdml/data/npz/toluene_ccsd_t.zip # url to download the npz. optional
- dataset_file_name: ./benchmark_data/toluene_ccsd_t-train.npz # path to data set file
58
+ P: pbc
61
+ - pbc
64
+ - Si
- - H
- - C
67
+ wandb: false # we recommend using wandb for logging
- wandb: true # we recommend using wandb for logging
77
+ n_train: 3000 # number of training data
+ n_val: 500 # number of validation data
- n_train: 100 # number of training data
- n_val: 50 # number of validation data
80
+ batch_size: 2
- batch_size: 5
from nequip.
Hm nevermind, that's a very reasonable set of parameters.
Looks like it's dying before the model is ever called during the neighbor list preprocessing of the dataset. This preprocessing step can be run on a CPU node by calling nequip-benchmark my-config.yaml
, where presumably you are able to successfully allocate more CPU RAM to the SLURM job?
from nequip.
Ok so if I understand correctly, I should do,
nequip-benchmark example.yaml
nequip-train example.yaml
, is that correct?
Like I said is there any rule of thumb on how much ram to request?
Will allocating more cores help in nequip-benchmark
(is it parallel)?
Lastly, I can run these scripts explicitly right?
python nequip/benchmark.py example.yaml
python nequip/train.py example.yaml
from nequip.
Great, glad this resolved your issue @ipcamit , and thank you for documenting it for future users!
from nequip.
Awesome, great to hear.
from nequip.
Related Issues (20)
- ❓ [QUESTION] Colab tutorial HOT 1
- ❓ [QUESTION] Restart run HOT 1
- issue when using nequip-deploy 🐛 [BUG] HOT 9
- 🐛 [BUG] Cannot restart run with different dataset HOT 4
- Is it possible to train on xyz format data with multiple molecules❓ [QUESTION] HOT 2
- run works on colab but fails on spyder❓ [QUESTION] HOT 4
- ❓ [QUESTION] Custom layer with control structure not supported? HOT 1
- The use of nequip command HOT 1
- 🐛 [BUG] Cannot run nequip-train with provided example HOT 4
- 🌟 [FEATURE]How to Train and Validate on Separate Datasets HOT 2
- How to do custom EarlyStopping?❓ [QUESTION] HOT 4
- ❓ [QUESTION]
- ❓ [QUESTION] About the data class AtomicData HOT 3
- bugs with "initialize_from_state"🐛 [BUG] HOT 5
- how to choose nosehoover the value of nvt_q❓ [QUESTION]
- ❓ [QUESTION]Finetuning Validation Error Higher than Pre-training Error in Nequip
- Confusion about `num_frames` attribute in `HDF5Dataset` HOT 1
- ❓ [QUESTION]how can I use the ase calculator for testing ?
- What is the unit of virials❓ [QUESTION] HOT 3
- MLFF for Silicon
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nequip.