mir-group / allegro Goto Github PK

Allegro is an open-source code for building highly scalable and accurate equivariant deep learning interatomic potentials

Home Page: https://www.nature.com/articles/s41467-023-36329-y

License: MIT License

Python 100.00%

atomistic-simulations computational-chemistry deep-learning drug-discovery force-fields interatomic-potentials machine-learning materials-science molecular-dynamics pytorch

allegro's Introduction

Allegro

This package implements the Allegro E(3)-equivariant machine-learning interatomic potential (https://arxiv.org/abs/2204.05249).

In particular, allegro implements the Allegro model as an extension package to the NequIP package.

Installation

Please note that this package CANNOT be installed from PyPI as pip install allegro.

allegro requires the nequip package and its dependencies; please see the NequIP installation instructions for details.

Once nequip is installed, you can install allegro from source by running:

git clone --depth 1 https://github.com/mir-group/allegro.git
cd allegro
pip install .

Tutorial

The best way to learn how to use Allegro is through the Colab Tutorial. This will run entirely on Google's cloud virtual machine, you do not need to install or run anything locally.

Usage

Allegro models are trained, evaluated, deployed, etc. identically to NequIP models using the nequip-* commands. See the NequIP README for details.

The key difference between using an Allegro and NequIP model is in the options used to define the model. We provide two Allegro config files analogous to those in nequip:

configs/minimal.yaml: A minimal example of training a toy model on force data.
configs/example.yaml: Training a more realistic model on forces and energies. Start here for real models!

The key option that tells nequip to build an Allegro model is the model_builders option, which we set to:

model_builders:
 - allegro.model.Allegro
 # the typical model builders from `nequip` are still used to wrap the core Allegro energy model:
 - PerSpeciesRescale
 - ForceOutput
 - RescaleEnergyEtc

LAMMPS Integration

We offer a LAMMPS plugin pair_allegro to use Allegro models in LAMMPS simulations, including support for Kokkos acceleration and MPI and parallel simulations. Please see the pair_allegro repository for more details.

References and citing

The Allegro model and the theory behind it is described in our pre-print:

Learning Local Equivariant Representations for Large-Scale Atomistic Dynamics
Albert Musaelian, Simon Batzner, Anders Johansson, Lixin Sun, Cameron J. Owen, Mordechai Kornbluth, Boris Kozinsky
https://arxiv.org/abs/2204.05249
https://doi.org/10.48550/arXiv.2204.05249

The implementation of Allegro is built on NequIP [1], our framework for E(3)-equivariant interatomic potentials, and e3nn, [2] a general framework for building E(3)-equivariant neural networks. If you use this repository in your work, please consider citing the NequIP code [1] and e3nn [3] as well:

Contact, questions, and contributing

If you have questions, please don't hesitate to reach out to batzner[at]g[dot]harvard[dot]edu and albym[at]seas[dot]harvard[dot]edu.

If you find a bug or have a proposal for a feature, please post it in the Issues. If you have a question, topic, or issue that isn't obviously one of those, try our GitHub Disucssions.

If your post is related to the general NequIP framework/package, please post in the issues/discussion on that repository. Discussions on this repository should be specific to the allegro package and Allegro model.

If you want to contribute to the code, please read CONTRIBUTING.md from the nequip repository; this repository follows all the same processes.

allegro's People

Contributors

Stargazers

Watchers

Forkers

meyerml cesmix-mit felixmusil seunghoon-yi cy0752 daniangio asclepiusinformatica ritesh001 shdchen william860925 nityasagarjena kuzmakhrabrov clecust vzl66576 htz1992213 matthewcarbone utkarshp1161 moleorbitalhybridanalyst hemanthkumarak drmaruyama raolixiang-up feifzhou sokseiha pincher-chen thegodone zshtju ecnuitaa gvisani franklalalala kenko911 jornada-group bdxka sudheerganisetti sailfish009 nicholasdow ken2403 ftry vickie02736 smuflhi ihumonen tjgiese marconobile keceli taufeqrazakh

allegro's Issues

Training get slower with increasing epochs

almgcu.yaml.txt

When I trained the model on my own data (input config posted above, modified based on example.yaml), I found that as the number of epochs increases, the training process becomes slower and slower. The results in metrics_epoch.csv are as follows:

Here, we can see that the first epoch took only ~1h, while the 8th epoch took ~8h. If I interrupt the training process and restart it, the first epoch after restarting takes about 1h again, but it gets slower and slower afterward. I'm not sure if I'm doing something wrong that's causing this strange behavior.

Model training for StressForceOutput

Hi,

Thanks for making the Allegro repo. be public. I was just wondering if you have any guidance or though on preparing a config file when training the Allegro that predicts stress tensor outputs as well in addition to forces and total potential energies (StressForceOutput).

I've tried several attempts for my dataset with the following configurations, but losses for stress tensor decreases very slowly and marginally while losses for forces or energies keep decreasing after a certain number of training epochs.

Attempt 1: Applying PerAtomMSELoss
loss_coeffs:
forces: 1.
total_energy:
- 1.
- PerAtomMSELoss
stress:
- 1.
- PerAtomMSELoss

Attempt 2: Assigning more weights to loss for stress tensor predictions
loss_coeffs:
forces: 1.
total_energy:
- 1.
- PerAtomMSELoss
stress: 100.

Attempt 3: Assigning simple MSE loss function for stress tensor
loss_coeffs:
forces: 1.
total_energy:
- 1.
- PerAtomMSELoss
stress: 1.

Otherwise, do you recommend not to add loss for stress tensor?
Any recommendation or guidance when I use Allegro to predict stress tensors, forces, and potential energies would be welcome!

Kind regards,

how can i conbine it with gromacs?

in my own research, trajctories are generated from gromacs, is it possible to combine allegro with gromacs?

Error of using ssp as activation function

I use ssp as activation function. However, the following error exists. I found that removing @torch.jit.script of ssp function in nequip can help run smoothly.

Traceback (most recent call last):
  File "/Users/shipengjie/anaconda3/lib/python3.10/site-packages/nequip/utils/auto_init.py", line 232, in instantiate
    instance = builder(positional_args, final_optional_args)
  File "/Users/shipengjie/anaconda3/lib/python3.10/site-packages/allegro/nn/allegro.py", line 316, in __init__
    two_body_latent(
  File "/Users/shipengjie/anaconda3/lib/python3.10/site-packages/allegro/nn/_fc.py", line 153, in __init__
    features = nonlinearity(features)
RuntimeError: ShiftedSoftPlus() Expected a value of type 'Tensor (inferred)' for argument 'x' but instead found type 'Proxy'.
Inferred 'x' to be of type 'Tensor' because it was not annotated with an explicit type.
Position: 0
Value: Proxy(matmul)
Declaration: ShiftedSoftPlus(Tensor x) -> (Tensor)
Cast error details: Unable to cast Proxy(matmul) to Tensor

The system I am running is macos.

ase calculator

Hi,
I want to use this model for calculator some system with ase, if or not the code for interface of ase?

How is allegro trained with forces? And are there available weights for running full scale MD simulations?

Hi developers of Allegro,

I have two questions namely on the methodology behind allegro and it's applications:

In the Nature communications publication, the model's architecture is depicted as such where the energies of the atoms are summed to get the energies of the system. I can see how the loss function with energy can be differentiated with this through the rest of the model, but I don't quite understand how you can include the force labels in the training data... A number of examples from the sGDML datasets have forces in the XYZ but how are these actually included in the loss function if the model predicts energy?
Also from the more recent preprint, you guys have also shown the ability to run full scale MD simulations on proteins. I was wondering are weights available for these versions of the models for end users to run MD simulations with?

Thank you for the time to address my queries.

How can I modify Training and loss forces of my system specially Li atoms in my LGPO system?

Dear all,
I have recently Used Nequip-Allegro Framework to retrain DFT data in the LGPO systems. I have used 30000 configurations for retraining DFT data.By using the ASE calculator generated ML forces to compare the forces with DFT forces .Unfortunately the Loss of forces did not improve at all. I have changed rcutoff from 5 to 7 and 14.I have Changed Max epoch from 100 to 200. I have changed batch_size from 1 to 4 and 6 . I have changed different splits (80-20 and 70-30) for training and validations.I have checked Whether to shuffle the training data or not. I have checked Themathematical expression for the overall LOSS and changed the force loss coefficient of 1.0 to 100. . I have checked different seeds to have different training and validation sets. I have checked different lmax=1 and 2.

I tried to do anything to modify ML forces (loss_f,loss_e and loss) specially for Li atoms.
The total loss still remains near 23 and loss_f is near 0.23 with the force coefficient of 100. and the total loss can be modified to 0.23 with the force coefficient of 1. and loss_f is still unchanged and near 0.23 .However,these improvements seem not successful and still large.

Do you have any new ideas to help me improve forces and overall results?
Name Epoch wal (hours) LR loss_f loss_e loss f_mae f_rmse e_mae e/N_mae
Train 200 9.969141667 0.002 0.23 0.0368 23.0 0.213 0.485 0.867 0.0173
Validation 200 9.969141667 0.002 0.204 0.000193 20.4 0.200 0.456 0.699 0.014

"atom_types" not included in pytest and pytest not pass

Describe the bug

There is a bug in tests/conftest.py that typemapper is not set for atomic_batch, which will cause atom_types are missed in data.
Also, pytest throws several failure as following. It seems that the numeric_gradient, partial_forces and normalized_basis tests are not passed.
Meanwhile, TorchScript interpreter failed for float64.

Reproduce

Just do pytest

FAILED tests/model/test_allegro.py::TestGradient::test_numeric_gradient[float32-config0-device0] - AssertionError: assert (tensor(False, device='cuda:0') or tensor(False, device='cuda:0'))
FAILED tests/model/test_allegro.py::TestGradient::test_numeric_gradient[float32-config1-device0] - AssertionError: assert (tensor(False, device='cuda:0') or tensor(False, device='cuda:0'))
FAILED tests/model/test_allegro.py::TestGradient::test_numeric_gradient[float32-config2-device0] - AssertionError: assert (tensor(False, device='cuda:0') or tensor(False, device='cuda:0'))
FAILED tests/model/test_allegro.py::TestGradient::test_numeric_gradient[float32-config3-device0] - AssertionError: assert (tensor(False, device='cuda:0') or tensor(False, device='cuda:0'))
FAILED tests/model/test_allegro.py::TestGradient::test_partial_forces[float32-device0] - AssertionError: assert False
FAILED tests/nn/test_norm_basis.py::test_normalized_basis[float32-0.2] - assert tensor(False)
FAILED tests/nn/test_norm_basis.py::test_normalized_basis[float32-1.0] - assert tensor(False)
FAILED tests/model/test_allegro.py::TestWorkflow::test_weight_init[float64-device0-config0-model0] - RuntimeError: The following operation failed in the TorchScript interpreter.
FAILED tests/model/test_allegro.py::TestWorkflow::test_weight_init[float64-device0-config0-model1] - RuntimeError: The following operation failed in the TorchScript interpreter.
FAILED tests/model/test_allegro.py::TestWorkflow::test_jit[float64-device0-config0-model0] - RuntimeError: The following operation failed in the TorchScript interpreter.
FAILED tests/model/test_allegro.py::TestWorkflow::test_jit[float64-device0-config0-model1] - RuntimeError: The following operation failed in the TorchScript interpreter.
....

Environment (please complete the following information):

OS: centOS 7
python version: 3.9.12
python environment (commands are given for python interpreter):
nequip version: 0.5.5
e3nn version: 0.5.0
pytorch version: 1.11.0+cu113

Neighbor lists in lammps-allegro

Hello,
I have an issue in compiling Lammps with the Allegro patch, as I get the following error (with gcc):

pair_allegro.cpp:129:33: error: ‘int LAMMPS_NS::NeighRequest::ghost’ is protected within this context
129 | neighbor->requests[irequest]->ghost = 1;

I had to comment out this line in pair_allegro.cpp and hard-code ghost = 1 in neigh_request.cpp. It works, but probably not the best thing to do... (I think it then messes up with RDF calculations...)

Thanks for your help,

Nicephore

Training dataset is wrongly thought to only have two datapoints

This is related to this discussion:

#74

As advised there, I used ase.io.read and ase.io.write to make my .extxyz training set our of ase atoms, but when I run nequip-train I get the following error:

Traceback (most recent call last):
  File ".../nequip-train", line 8, in <module>
    sys.exit(main())
  File ".../nequip/scripts/train.py", line 72, in main
    trainer = fresh_start(config)
  File ".../nequip/scripts/train.py", line 160, in fresh_start
    trainer.set_dataset(dataset, validation_dataset)
  File ".../nequip/train/trainer.py", line 1164, in set_dataset
    raise ValueError(
ValueError: too little data for training and validation. please reduce n_train and n_val

I modified nequip/train/trainer.py so that it would print out total_n which is the number of datapoints. This comes out as 2. However, my dataset has 249 datapoints in it. Here are the first 2 as an example:

6
Properties=species:S:1:charge:R:1:pos:R:3:forces:R:3 energy=3.3568491152989464E+00 pbc="F F F"
C    2.8860151849165884E-02   -1.9388596926913013E-02   -2.8759623532924683E-03   -2.6833257705618396E+00   -5.0188550432362087E+00   -7.8061520280432894E+00 
C   -2.1500824022609995E-02    3.6524386361154572E-03   -1.9698357525108050E-03    2.3421032349874338E+00   -6.6902716186377642E+00   -1.7849302559666409E+00 
C    2.3387800610515291E-02    3.8019240081639564E-03   -3.6286631993187970E-03   -3.5688662248506908E+00   -8.1820582146951519E+00    2.9663094116194868E+00 
C    1.3905598298491445E-02    7.5721377048330769E-03    8.0048531530292778E-03    9.5122339087921475E+00   -9.7041169051925280E-01    1.3042310895041316E+00 
C   -4.5729982211133953E-02    3.5546925801004338E-03    2.9333833145097753E-03   -1.2541717017776133E-01   -8.2066513548131166E+00    8.1008209468107211E+00 
C    1.0772554755713285E-03    8.0740399770008677E-04   -2.4637751624169836E-03   -4.5859471646614072E+00    3.5845739562088865E+00    8.2385401713477435E+00 
6
Properties=species:S:1:charge:R:1:pos:R:3:forces:R:3 energy=3.4898710368025143E+00 pbc="F F F"
C    7.6319141601180404E-03    8.9401622864701772E-03   -1.7921386926762573E-03    5.1772637533841459E-01   -2.7859567151648097E+00   -2.8128901836587339E-01 
C   -7.1088268866891773E-03    2.4796940863442618E-03   -2.0363664484700790E-02    8.5094886685353437E+00    3.2758187190546590E+00   -4.6838215430390271E+00 
C    1.4622996296095836E-02   -2.3619554337368280E-02   -6.2698372277412611E-03    9.1124305059183399E+00   -4.1808210366970284E+00    4.2946053591808280E+00 
C   -1.3909049200163072E-02    3.1250438952696268E-04   -2.4999557169415663E-03   -3.9259066958050122E+00   -6.5920997070775220E+00    6.8063951037962402E+00 
C   -2.4331619371849824E-02    6.8812024942755349E-03    3.4277307003635726E-03    2.0102930012740594E+00    4.1305435735101490E+00    8.8513121399612977E+00 
C    2.3094585002488198E-02    5.0059910807513461E-03    2.7497865421696303E-02    8.3670043621588555E+00    2.4338406515288975E+00    8.2732601005813198E+00

Here is the relevant part of my config yaml:

dataset: ase
dataset_file_name: training.extxyz
ase_args:
 format: extxyz

chemical_symbols:
  - C

# training
n_train: 200
n_val: 49
batch_size: 1
max_epochs: 1
learning_rate: 0.002

What's wrong with my input? I have 249 structures but it thinks there are 2. I've checked all the docs for ase, Nequip and Allegro and it looks OK. This case has no PBC but when I run it with lattice parameters (physically valid ones) and pbc="T T T" it still gives the same error. I'm sure I've made an error somewhere but I can't find it.

Thanks in advance.

Different numbers of # Epoch in output files

Dear developers,

I found the epoch numbers in the output files are different in one training loop.

Is that normal and why does this happen ?

Thanks.
Sean

Extrapolation task

Hello,
I have a dataset of small metal nanoparticles (with one sorbed non-metal atom on surface) of size 10-80 atoms with calculated relaxed energy as a target. I.e. Y-Me_x systems, Y = non-metal atom, Me = metal atom, x = 10-80.
I want to train Allegro model using this data to predict relaxed energy of nanoparticles of larger size (> 200 atoms). I.e. for Y-Me_x, x > 200.
How could I solve this problem? Are there any important parameters/scalers in config-file connected with described task?

Activation parity

Hello - thanks for developing and maintaining these tools!

I was looking at the nequip full config and noticed the following options:

# scalar nonlinearities to use — available options are silu, ssp (shifted softplus), tanh, and abs.
# Different nonlinearities are specified for e (even) and o (odd) parity;
# note that only tanh and abs are correct for o (odd parity).
# silu typically works best for even 
nonlinearity_scalars:
  e: silu
  o: tanh

nonlinearity_gates:
  e: silu
  o: tanh

Are these options/parity differences applicable to Allegro models?

Question about the Tensor Product module.

Hello, I have been reading the code of allegro and trying to implement an E3-equivariant model recently. I am confused about the tensor product construction process in the allegro model. For example, in nn/_strided/_linear, and nn/_stride/_contract, some modules seem to realize the module to build tensor-production and linear module in e3nn. Still, when I compare the code with the e3nn module, I find many differences, such as the normalization, and some parameters such as the sparse mode.

So my question is, what are the functional differences comparing the tensor production module in allegro/nn concerning e3nn?

Training on QM9

Thanks for posting the nice code! I noticed there is an experiment of the model on QM9 dataset in the paper but not included in the code. I was wondering whether there are any changes needed for the code to be adapted to training on molecules or crystals with different number of atoms.

how to restart a training job that was killed by walltime

Hi I wonder how to restart a training job (by nequip-train) that was killed because of walltime?

I just tried to resubmit the job at the original folder, but it fails immediately.

Thank you very much

Best
Geng

Broken config links in README

The links to config files are broken - they have extra backticks in the URLs. For example, this:

[`configs/minimal.yaml`](`configs/minimal.yaml`)

Should become this:

[`configs/minimal.yaml`](configs/minimal.yaml)

Question about Allegro v0.3.0

Hi,
I am recently reading the paper "Scaling the leading accuracy of deep equivariant models to biomolecular simulations of realistic size". It describes some efficient optimizations to speed up Allegro.

And the paper saying "All experiments were run with the Allegro code available at https://github.com/mir-group/allegro under git commit cc76ba8 and version 0.3.0.".

However I can not find neither this commit or v0.3.0 in this repo. So where can I find the code version that including all the optimizations described in the paper?

Thank you！

Training with heterogeneous systems

Hello - Thanks for developing and maintaining the tools. I realize from experience this is a vague question, but are there certain architectural/hyperparameter considerations to take into account when training on datasets with many different systems (in both size and chemical composition)?

I have been having a hard time cracking < 50 kcal/mol energy validation MAE (after converting both energies and forces from eV) on this dataset using a config file similar to aspirin example in this repo (see below). I should note that I am working on the develop branch of nequip to take advantage of HDF5 datasets (though I have used also the extxyz format and see the same results).

There are many things that can go wrong in building ML potentials of course, and there is always the issue of hyperparameter search - are there certain options in this package that work better for datasets with varying system composition?

root: results/protein_frags_solv_train_24500_rcut_6
run_name: protein_frags_solv_train_24500_rcut_6

seed: 123456

dataset_seed: 123456

append: true

default_dtype: float32

model_builders:
 - allegro.model.Allegro
 # the typical model builders from `nequip` can still be used:
 - PerSpeciesRescale
 - ForceOutput
 - RescaleEnergyEtc

r_max: 6.0
avg_num_neighbors: auto
BesselBasis_trainable: true
PolynomialCutoff_p: 6
l_max: 2
parity: o3_full
num_layers: 2
env_embed_multiplicity: 128
embed_initial_edge: true
two_body_latent_mlp_latent_dimensions: [128, 256, 512, 1024]
two_body_latent_mlp_nonlinearity: silu
two_body_latent_mlp_initialization: uniform
latent_mlp_latent_dimensions: [1024, 1024, 1024]
latent_mlp_nonlinearity: silu
latent_mlp_initialization: uniform
latent_resnet: true
env_embed_mlp_latent_dimensions: []
env_embed_mlp_nonlinearity: null
env_embed_mlp_initialization: uniform

edge_eng_mlp_latent_dimensions: [128, 64]
edge_eng_mlp_nonlinearity: null
edge_eng_mlp_initialization: uniform

wandb: true
wandb_project: solvated_protein_fragments
verbose: debug

n_train: 22050
n_val: 2450

batch_size: 5
max_epochs: 1000000
learning_rate: 0.001
train_val_split: random
shuffle: true
metrics_key: validation_loss
use_ema: true
ema_decay: 0.99
ema_use_num_updates: true
loss_coeffs:
  forces: 1.
  total_energy:
    - 1.
    - PerAtomMSELoss

optimizer_name: Adam
optimizer_params:
  amsgrad: false
  betas: !!python/tuple
  - 0.9
  - 0.999
  eps: 1.0e-08
  weight_decay: 0.
lr_scheduler_name: ReduceLROnPlateau
lr_scheduler_patience: 10
lr_scheduler_factor: 0.5

early_stopping_upper_bounds:
  cumulative_wall: 604800.
early_stopping_lower_bounds:
  LR: 1.0e-5
early_stopping_patiences:
  validation_loss: 100

dataset: hdf5
dataset_file_name: path_to_my_h5
dataset_Atomicdata_options:
  r_max: 6.0

chemical_symbol_to_type:
  H: 0
  C: 1
  N: 2
  O: 3
  S: 4

GPU out of memory problem

Hi,

I am recently trying to use allegro to train on neopentyl glycol (NPG). I have 1216 atoms in a frame and use extended xyz file as my data format. I used the same architecture as the aspirin configs. It worked for water case, which I have 192 atoms in each frame. But when I change to NPG, it always has the GPU out of memory problem unless I reduce the dimensions of the network. The GPU has 48 GB memory. Based on the previous issues, it seems 48 GB memory should be enough. My guess is there are some problems with the data size. Do you think the issue is related to number of atoms in each frame?

Thank you!

ASE calculator always fails on third calculation using same calculator instance

Describe the bug
After loading a deployed model as an ASE calculator instance, the calculator consistently gives an error on the third different structure it calculates.

To Reproduce

from nequip.ase.nequip_calculator import NequIPCalculator
from ase.build import bulk
calc = NequIPCalculator.from_deployed_model('deployed_Li_model.pth', device='cpu')       #Same error for device='cuda'
a1 = bulk('Li', 'bcc', a=3.4)
a2 = bulk('Li', 'bcc', a=3.4).repeat([2,2,2])
a3 = bulk('Li', 'bcc', a=3.5)           #The 3 structures must be different
calc.get_potential_energy(a1)     #Works fine
calc.get_potential_energy(a2)     #Works fine
calc.get_potential_energy(a3)     #Gives torchscript error below even with forces

The traceback is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mphuthi/.conda/envs/allegro/lib/python3.9/site-packages/ase/calculators/calculator.py", line 737, in get_property
    self.calculate(atoms, [name], system_changes)
  File "/home/mphuthi/.conda/envs/allegro/lib/python3.9/site-packages/nequip/ase/nequip_calculator.py", line 118, in calculate
    out = self.model(data)
  File "/home/mphuthi/.conda/envs/allegro/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Unsupported value kind: Tensor

This happens for every 3 structures, I've tried multiple from different datasets and the error is reproduced.

Expected behavior
I expect to be able to "reuse" a calculator as many times as I want without having to create a new instance repeatedly

Environment (please complete the following information):

OS: centOS 7
python version: 3.9.12
python environment (commands are given for python interpreter):
- nequip version: 0.5.5
- e3nn version: 0.4.4
- pytorch version: 1.10.2+cu102
(if relevant) GPU support with CUDA
- cuda Version according to nvcc: 10.2
- cuda version according to PyTorch : 10.2

Stuck in initialization

I'm trying to training a smile model with ~500 structures, but training in stuck on the 0th iteration.

Output:

Torch device: cuda
Processing dataset...
Loaded data: Batch(atomic_numbers=[52369, 1], batch=[52369], cell=[582, 3, 3], edge_cell_shift=[4453866, 3], edge_index=[2, 4453866], forces=[52369, 3], pbc=[582, 3], pos=[52369, 3], ptr=[583], total_energy=[582, 1])
    processed data size: ~120.96 MB
Cached processed data to disk
Done!
Successfully loaded the data set of type ASEDataset(582)...
Replace string dataset_forces_rms to 0.7190595865249634
Replace string dataset_per_atom_total_energy_mean to -8.752270698547363
Atomic outputs are scaled by: [O, Ti: 0.719060], shifted by [O, Ti: -8.752271].
Replace string dataset_forces_rms to 0.7190595865249634
Initially outputs are globally scaled by: 0.7190595865249634, total_energy are globally shifted by None.
Successfully built the network...
Number of weights: 10987400
Number of trainable weights: 10987400
! Starting training ...

validation
# Epoch batch         loss       loss_f       loss_e       f_rmse     e/N_rmse

The nequip-train is just stuck here for a long time.

error on validation lower than on training for aspirin dataset

Hello
I am using the config /example.yml to train a model on the aspirin dataset.
The error (MAE and RMSE) I am getting on the validation set on energy is 5x-10x lower than the one on the training set (MAE).
The dataset is shuffled and it uses a random split train/validation .
This is true throughout the training ( I am stopping for now at 100 epochs).

Allegro memory requirements?

I have a similar issue with this previous issue of nequip(mir-group/nequip#293), but I'm only using allegro so I'm not sure if it's the case. I have a 2000 configuration dataset, each with ~600 atoms with pbc, in the npz format. However, the preprocessing effort always failed with OOM error on the CPU side, even if I increased my CPU memory to 160GB. The network size I'm using is also way smaller than the suggested values, and I'm only using a cutoff of 4 Angstroms. What other parameters should I try to tune?

Colab example is unreachable.

Hi, I am interested in testing allegro, but it seems there is some issue related to the access authority such that the colab example is not reachable. Here is a screenshot to help understand the error better:

Any idea how to fix it?

Best,
Zekun

Request for Access to Dataset and Training Settings Mentioned in the Article "Scaling the leading accuracy of deep equivariant models to biomolecular simulations of realistic size"

Hello,

I hope this message finds you well. I recently came across your team's article titled "Scaling the leading accuracy of deep equivariant models to biomolecular simulations of realistic size," which mentions the use of a model from your repository. I'm interested in replicating some of the experiments mentioned in the article and was wondering if it would be possible to obtain access to the dataset and training settings used for those experiments.

Access to the specific data and training configurations referenced in the article would greatly assist in my research efforts, and it would be immensely appreciated. Having access to this information would not only help in reproducing the results but also contribute to a better understanding of the model's performance in biomolecular simulations.

If there are any specific steps or procedures I need to follow to obtain this data and configuration information, please let me know. I understand the importance of proper attribution and usage of research data, and I assure you that I will adhere to any terms or conditions set forth for access.

Thank you for your time and consideration. I look forward to your response.

Unable to open colab tutorial

On opening the colab tutorial I get an error: "There was an error loading this notebook. Ensure that the file is accessible and try again."

Other notebooks work fine (for example the nequip-tutorial), so I think it's not on my end.

I did find a workaround: opening in an incognito tab (chrome). So maybe it's something on my end after all, but I thought I'd post anyway just in case.

AttributeError: 'numpy.int64' object has no attribute 'unsqueeze'

I'm trying to run a toy example of training using a toy dataset just to make sure I can get ase datasets working. I get this error when I try and run Nequip:
AttributeError: 'numpy.int64' object has no attribute 'unsqueeze'

The dataset I'm using is this (taken from the discussion #74):

3
Lattice="1.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 3.0" Properties=species:S:1:pos:R:3:forces:R:3 energy=-100 stress="0.0002 0.003 -0.004 0.003 -0.004 -0.005 -0.004 -0.005 -0.006" free_energy=-100 pbc="T T T"
H      14.80702000       9.47114000      14.83362000       0.05630700       0.18844700       0.12262800
C      14.40303000       9.58896000       4.81541000       0.28543000      -0.56192100       0.59003500
Al       0.15995000       9.61979000       9.87331000      -0.15746000       0.51516600       0.14365000
2
Lattice="1.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 3.0" Properties=species:S:1:pos:R:3:forces:R:3 energy=-100 stress="0.0002 0.003 -0.004 0.003 -0.004 -0.005 -0.004 -0.005 -0.006" free_energy=-100 pbc="T T T"
H      14.80702000       9.47114000      14.83362000       0.05630700       0.18844700       0.12262800
C      14.40303000       9.58896000       4.81541000       0.28543000      -0.56192100       0.59003500

Here is the data section of my .yaml:

dataset: ase
dataset_file_name: training.extxyz
ase_args:   
 format: extxyz

chemical_symbols:
  - H
  - C
  - Al

Here is the full error output:

Processing dataset...
Traceback (most recent call last):
  File "nequip/utils/auto_init.py", line 232, in instantiate
    instance = builder(**positional_args, **final_optional_args)
  File "nequip/data/dataset.py", line 880, in __init__
    super().__init__(
  File "nequip/data/dataset.py", line 166, in __init__
    super().__init__(root=root, type_mapper=type_mapper)
  File "nequip/data/dataset.py", line 50, in __init__
    super().__init__(root=root, transform=type_mapper)
  File "nequip/utils/torch_geometric/dataset.py", line 91, in __init__
    self._process()
  File "nequip/utils/torch_geometric/dataset.py", line 176, in _process
    self.process()
  File "nequip/data/dataset.py", line 218, in process
    data = self.get_data()
  File "nequip/data/dataset.py", line 971, in get_data
    datas = reader(rank=0)
  File "nequip/data/dataset.py", line 789, in _ase_dataset_reader
    AtomicData.from_ase(atoms=atoms, **atomicdata_kwargs)
  File "nequip/data/AtomicData.py", line 443, in from_ase
    return cls.from_points(
  File "nequip/data/AtomicData.py", line 330, in from_points
    return cls(edge_index=edge_index, pos=torch.as_tensor(pos), **kwargs)
  File "nequip/data/AtomicData.py", line 225, in __init__
    _process_dict(kwargs)
  File "nequip/data/AtomicData.py", line 155, in _process_dict
    kwargs[k] = v.unsqueeze(-1)
AttributeError: 'numpy.int64' object has no attribute 'unsqueeze'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "nequip-train", line 8, in <module>
    sys.exit(main())
  File "nequip/scripts/train.py", line 72, in main
    trainer = fresh_start(config)
  File "nequip/scripts/train.py", line 148, in fresh_start
    dataset = dataset_from_config(config, prefix="dataset")
  File "nequip/data/_build.py", line 78, in dataset_from_config
    instance, _ = instantiate(
  File "nequip/utils/auto_init.py", line 234, in instantiate
    raise RuntimeError(
RuntimeError: Failed to build object with prefix `dataset` using builder `ASEDataset`

I'm running it on a mac. Here are details of the environment:

# Name                    Version                   Build  Channel
appdirs                   1.4.4                    pypi_0    pypi
ase                       3.22.1                   pypi_0    pypi
ca-certificates           2023.12.12           hca03da5_0  
certifi                   2024.2.2                 pypi_0    pypi
cffi                      1.16.0           py39h80987f9_0  
charset-normalizer        3.3.2                    pypi_0    pypi
click                     8.1.7                    pypi_0    pypi
contourpy                 1.2.0                    pypi_0    pypi
cycler                    0.12.1                   pypi_0    pypi
docker-pycreds            0.4.0                    pypi_0    pypi
e3nn                      0.5.1                    pypi_0    pypi
fonttools                 4.49.0                   pypi_0    pypi
future                    0.18.3           py39hca03da5_0  
gitdb                     4.0.11                   pypi_0    pypi
gitpython                 3.1.42                   pypi_0    pypi
idna                      3.6                      pypi_0    pypi
importlib-resources       6.1.2                    pypi_0    pypi
kiwisolver                1.4.5                    pypi_0    pypi
libblas                   3.9.0           16_osxarm64_openblas    conda-forge
libcblas                  3.9.0           16_osxarm64_openblas    conda-forge
libcxx                    14.0.6               h848a8c0_0  
libffi                    3.4.4                hca03da5_0  
libgfortran               5.0.0           11_3_0_hca03da5_28  
libgfortran5              11.3.0              h009349e_28  
liblapack                 3.9.0           16_osxarm64_openblas    conda-forge
libopenblas               0.3.21               h269037a_0  
llvm-openmp               14.0.6               hc6e5704_0  
matplotlib                3.8.3                    pypi_0    pypi
mir-allegro               0.2.0                    pypi_0    pypi
mpmath                    1.3.0                    pypi_0    pypi
ncurses                   6.4                  h313beb8_0  
nequip                    0.5.6                    pypi_0    pypi
ninja                     1.10.2               hca03da5_5  
ninja-base                1.10.2               h525c30c_5  
npzviewer                 0.2.0                    pypi_0    pypi
numpy                     1.26.4                   pypi_0    pypi
openssl                   3.0.13               h1a28f6b_0  
opt-einsum                3.3.0                    pypi_0    pypi
opt-einsum-fx             0.1.4                    pypi_0    pypi
packaging                 23.2                     pypi_0    pypi
pillow                    10.2.0                   pypi_0    pypi
pip                       23.3.1           py39hca03da5_0  
protobuf                  4.25.3                   pypi_0    pypi
psutil                    5.9.8                    pypi_0    pypi
pycparser                 2.21               pyhd3eb1b0_0  
pyparsing                 3.1.1                    pypi_0    pypi
pyqt5                     5.15.10                  pypi_0    pypi
pyqt5-qt5                 5.15.12                  pypi_0    pypi
pyqt5-sip                 12.13.0                  pypi_0    pypi
python                    3.9.18               hb885b13_0  
python-dateutil           2.8.2                    pypi_0    pypi
pytorch                   1.10.2          cpu_py39h23cb94c_0  
pyyaml                    6.0.1                    pypi_0    pypi
readline                  8.2                  h1a28f6b_0  
requests                  2.31.0                   pypi_0    pypi
scipy                     1.12.0                   pypi_0    pypi
sentry-sdk                1.40.6                   pypi_0    pypi
setproctitle              1.3.3                    pypi_0    pypi
setuptools                68.2.2           py39hca03da5_0  
six                       1.16.0                   pypi_0    pypi
smmap                     5.0.1                    pypi_0    pypi
sqlite                    3.41.2               h80987f9_0  
sympy                     1.12                     pypi_0    pypi
tk                        8.6.12               hb8d0fd4_0  
torch-ema                 0.3                      pypi_0    pypi
torch-runstats            0.2.0                    pypi_0    pypi
tqdm                      4.66.2                   pypi_0    pypi
typing-extensions         4.9.0            py39hca03da5_1  
typing_extensions         4.9.0            py39hca03da5_1  
tzdata                    2024a                h04d1e81_0  
urllib3                   2.2.1                    pypi_0    pypi
wandb                     0.16.3                   pypi_0    pypi
wheel                     0.41.2           py39hca03da5_0  
xz                        5.4.5                h80987f9_0  
zipp                      3.17.0                   pypi_0    pypi
zlib                      1.2.13               h5a0b063_0

Thanks

parity missing in minimal config

Running nequip-train configs/minimal.yaml' fails on a key error because parityis missing. Addingparity: o3_fulltoconfigs/minimal.yaml` fixes it.

Very Slow Update After the 2nd Batch 1st Epoch

Hello MIR group,

I'm using Allegro as well as NequIP and FLARE to build MLIPs for modeling condensed phase systems and systems for heterogeneous catalysis, and I'm having a little bit of difficulty with Allegro. On my laptop, I can build smaller Allegro models and training goes as expected. However, for larger models that I am training on Perlmutter, after training on the second batch of the first epoch, it takes a while for the 3rd batch to process, and I get the message copied below. After this message gets displayed, training continues as expected. Have you seen this issue before, and if so is there a way to fix this and make training not take so long in the beginning? I've copied the error message, my allegro config file, and my SLURM script on Perlmutter below. The SLURM script and config file are for a hyperparameter scan, and for the hyperparameters I have looked at so far they all have this issue. Any help would be much appreciated. Thanks!

Sincerely,
Woody

Message that appears in training:

# Epoch batch         loss       loss_f  loss_stress       loss_e        f_mae       f_rmse     Ar_f_mae  psavg_f_mae    Ar_f_rmse psavg_f_rmse        e_mae      e/N_mae   stress_mae  stress_rmse
      0     1        0.951        0.949     1.31e-05      0.00122        0.106        0.203        0.106        0.106        0.203        0.203         1.39      0.00546     0.000341     0.000754
      0     2          0.9        0.899     4.69e-06     0.000544        0.101        0.197        0.101        0.101        0.197        0.197        0.414      0.00385     0.000281     0.000451
/global/homes/w/wnw36/.conda/envs/nequip/lib/python3.10/site-packages/torch/autograd/__init__.py:276: UserWarning: operator() profile_node %884 : int[] = prim::profile_ivalue(%882)
 does not have profile information (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484808560/work/torch/csrc/jit/codegen/cuda/graph_fuser.cpp:104.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
      0     3         1.21         1.21     1.24e-05     0.000595        0.114        0.229        0.114        0.114        0.229        0.229        0.652      0.00382     0.000324     0.000732

SLURM script on Perlmutter:

#!/bin/bash
#SBATCH --job-name=nequip
#SBATCH --output=nequip.o%j
#SBATCH --error=nequip.e%j
#SBATCH --nodes=1
#SBATCH --time=24:00:00
#SBATCH --constraint=gpu
#SBATCH --qos=regular
#SBATCH --exclusive

module load python
conda activate nequip

mkdir -p outputs
for rcut in 4.0 6.0; do
	for learning in 0.001 0.005; do
		for lmax in 4 5; do 
			for nfeatures in 32 64; do
				for nlayers in 4; do
					file=gridsearch-$rcut-$learning-$lmax-$nfeatures-$nlayers.yaml
					sed -e "s/rcrcrc/$rcut/g" -e "s/lmaxlmaxlmax/$lmax/g" -e "s/lratelratelrate/$learning/g" -e "s/nfeatnfeatnfeat/$nfeatures/g" -e "s/nlayernlayernlayer/$nlayers/g" template.yaml > $file
					nequip-train $file > outputs/$rcut-$learning-$lmax-$nfeatures-$nlayers.log
				done
			done
		done
	done
done

Allegro config file:

run_name: 4.0-0.001-4-32-4-4

seed: 123456

dataset_seed: 123456

append: true

default_dtype: float32

model_builders:
 - allegro.model.Allegro
 - PerSpeciesRescale
 - StressForceOutput
 - RescaleEnergyEtc

r_max: 4.0
 
avg_num_neighbors: auto

BesselBasis_trainable: true

PolynomialCutoff_p: 6  

l_max: 4

parity: o3_full  

num_layers: 4

env_embed_multiplicity: 32

embed_initial_edge: true

two_body_latent_mlp_latent_dimensions: [128, 256, 512, 1024]
two_body_latent_mlp_nonlinearity: silu
two_body_latent_mlp_initialization: uniform

latent_mlp_latent_dimensions: [1024, 1024, 1024]

latent_mlp_nonlinearity: silu
latent_mlp_initialization: uniform
latent_resnet: true

env_embed_mlp_latent_dimensions: []

env_embed_mlp_nonlinearity: null

env_embed_mlp_initialization: uniform

edge_eng_mlp_latent_dimensions: [128]

edge_eng_mlp_nonlinearity: null

edge_eng_mlp_initialization: uniform


dataset: ase
dataset_file_name: ../trajectory/traj.xyz
ase_args:
  format: extxyz
chemical_symbol_to_type:
  Ar: 0
wandb: false
wandb_project: Ar
verbose: debug
n_train: 1300

n_val: 100

batch_size: 5
validation_batch_size: 10

max_epochs: 10
learning_rate: 0.001

train_val_split: random

shuffle: true

metrics_key: validation_loss
use_ema: true
ema_decay: 0.99

ema_use_num_updates: true

loss_coeffs:
  forces: 1.
  stress: 1.
  total_energy:          
    - 1.
    - PerAtomMSELoss

optimizer_name: Adam
optimizer_params:
  amsgrad: false
  betas: !!python/tuple
  - 0.9
  - 0.999
  eps: 1.0e-08
  weight_decay: 0.

lr_scheduler_name: ReduceLROnPlateau
lr_scheduler_patience: 50
lr_scheduler_factor: 0.5
early_stopping_upper_bounds:
  cumulative_wall: 604800.

early_stopping_lower_bounds:
  LR: 1.0e-5

early_stopping_patiences:
  validation_loss: 100

metrics_components:
  - - forces                              
    - mae                                
  - - forces
    - rmse
  - - forces
    - mae
    - PerSpecies: True                    
      report_per_component: False          
  - - forces
    - rmse
    - PerSpecies: True
      report_per_component: False
  - - total_energy
    - mae
  - - total_energy
    - mae
    - PerAtom: True                  
  - - stress
    - mae
  - - stress
    - rmse

Optimal hyper-parameters

Hi,
Thank you for providing the code base.
We want to reproduce your results, and we are wondering what are the hyper-parameters you are using for MD17 and QM9 respectively?
For MD17, you mentioned that it should be almost the same with this example.yml file and nequip full.yml file right?

Availability of config files to reproduce results

Hi, is there an availability of allegro input files? like for nequip: https://github.com/mir-group/nequip-input-files

Best way to integrate virial and stress calculation with allegro?

There have been multiple attempts that aims at integrating the virial and stress into nequip and allegro: nequip pull request, pair-allegro stress branch, etc. However, I haven't found a comprehensive tutorial to integrate virial and stress calculation with the allegro model yet. I understand that the stress feature is under development, but if I would like to test with the existing features, which branch should I use for nequip, allegro and pair-allegro? Thanks in advance!