mir-group / nequip Goto Github PK

NequIP is a code for building E(3)-equivariant interatomic potentials

Home Page: https://www.nature.com/articles/s41467-022-29939-5

License: MIT License

Python 100.00%

machine-learning atomistic-simulations molecular-dynamics computational-chemistry deep-learning interatomic-potentials force-fields pytorch drug-discovery materials-science

nequip's People

Contributors

Stargazers

Watchers

Forkers

albertzhu01 davidleocadio muraligopal keano130 xiangyan93 sophie-xhonneux sailfish009 mukhtarbayerouniversity felixmusil leoil svandenhaute shuaijiang-ustc fpriante isabel-s-rosa schiotz gyanachand1 sirmarcel sxie22 nw13slx n0w0f dariagusew jackevansadl hanzheng98 alocaputo pythonfz chemshift pobo95 byun-jinyoung b-czarnetzki wankiwi pdidydum ilyes319 peastman huhlim stevendengue gabriele16 yugang-hello dhroth sjtuzhanglei sourodeeproy yuchanpei dolphin4mi dsantra92 elijahahianyo rnaimehaom ssthurai casteln xinjianouyang reepoi mjrs33 fanguozheng abdalazizrashid mbrukman tohenkes daniangio hongyu-yu gud5201 giaguaro shdchen asclepiusinformatica tgmaxson william860925 dave452 frederikkjeldal jonathanschmidt1 kuzmakhrabrov ligerzero-ai vzl66576 ingcoder warshon kimteakjun moleorbitalhybridanalyst hityingph lijiezhong littleyu192 mkatouda utkarshp1161 hu-yanxiao ting-bao tamaswells yansonggu kennyfu1997 eipgen aroundpeking jacksonburns gvisani romarin87 franklalalala ryujh28 ipcamit bluehope bpuchala linyukong677 lingyu-kong drmaruyama rschireman floatingcatty bastonero bdxka sudheerganisetti

nequip's Issues

🐛 [BUG] lammps/build/lmp: No such file (tutorial)

Describe the bug
In colab tutorial example running LAMMPS fails at finding ''../lammps/build/lmp'':

!cd lammps_run/ && ../lammps/build/lmp -in toluene_minimize.in
>>>/bin/bash: ../lammps/build/lmp: No such file or directory

To Reproduce
Run the colab notebook with max_epochs 200

Expected behavior
LAMMPS runs

🐛 [BUG] Does not work with RTX 4080 GPU

Describe the bug
Does not work with RTX 4080 GPU

Expected behavior
Package works with latest NVIDIA GPUs, including RTX 4080

Additional context
PyTorch and CUDA of the supported by the package versions are not compatible with the latest NVIDIA GPUs. Is it possible (planned) to make NequIP compatible with the latest PyTorch and CUDA?

❓ [QUESTION] Dataset preprocessing computation times

Hi,

I'm experiencing some weird behaviour with the dataset preprocessing step. Depending on specific combinations of lattice vectors and pbcs, this preprocessing can take an unexpected amount of time. To frame the following, I'm extracting clusters of atoms from periodic UiO-66.

Three different (very small) datasets in the added zip-file nicely illustrate this:

'clusters_a.xyz' contains 10 clusters extracted from a 2x2x2 supercell. These still have the lattice vectors from the periodic system and enabled pbcs. The cell is much larger than a reasonable interaction radius, so there are no artificial periodic interactions.
'clusters_b.xyz' contains the same set of clusters, with the original periodic lattice vectors and disabled pbcs.
'clusters_c.xyz' once again contains the same clusters, this time without lattice vectors altogether.

A very rough timing job revealed that the first and third dataset preprocess in a comparable timespan (about half a second on my machine), whereas the second dataset took almost 2 orders of magnitude longer.

Any ideas as to where this discrepancy originates from? It can be easily avoided by a small change to the dataset, however, seems like it should not occur in the first place.

FYI, I recently updated to v0.5.0, but initially encountered this issue in v0.3.3.

datasets.zip

🐛 [BUG] Problem with nequip_calculator in ASE

When I try to run a MD NVT simulation with a deployed nequip model using ASE, a get the following error:

/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/nequip/scripts/deploy.py:115: UserWarning: Loaded model had a different value for _jit_bailout_depth than was currently set; changing the GLOBAL setting!
  warnings.warn(
Traceback (most recent call last):
  File "yaff_neq_MD.py", line 193, in <module>
    simulate(steps, step, start, atoms, calc_neq, temperature, pressure)
  File "yaff_neq_MD.py", line 144, in simulate
    verlet.run(steps)
  File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/yaff/sampling/iterative.py", line 128, in run
    if self.propagate():
  File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/yaff/sampling/verlet.py", line 351, in propagate
    self.epot = self.ff.compute(self.gpos, self.vtens)
  File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/yaff/pes/ff.py", line 157, in compute
    self.energy = self._internal_compute(my_gpos, my_vtens)
  File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/yaff/pes/ff.py", line 272, in _internal_compute
    result = sum([part.compute(gpos, vtens) for part in self.parts])
  File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/yaff/pes/ff.py", line 272, in <listcomp>
    result = sum([part.compute(gpos, vtens) for part in self.parts])
  File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/yaff/pes/ff.py", line 157, in compute
    self.energy = self._internal_compute(my_gpos, my_vtens)
  File "yaff_neq_MD.py", line 68, in _internal_compute
    energy = self.atoms.get_potential_energy() * molmod.units.electronvolt
  File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/ase/atoms.py", line 731, in get_potential_energy
    energy = self._calc.get_potential_energy(self)
  File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/ase/calculators/abc.py", line 24, in get_potential_energy
    return self.get_property(name, atoms)
  File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/ase/calculators/calculator.py", line 499, in get_property
    self.calculate(atoms, [name], system_changes)
  File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/nequip/ase/nequip_calculator.py", line 111, in calculate
    out = self.model(AtomicData.to_AtomicDataDict(data))
  File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: default_program(18): error: expected a ")"

default_program(21): error: expected a ";"

default_program(23): error: expression must have class type

default_program(24): error: expression must have class type

default_program(25): error: expression must have class type

default_program(26): error: expression must have class type

default_program(28): error: expression must have class type

7 errors detected in the compilation of "default_program".

nvrtc compilation failed: 

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)


template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}

extern "C" __global__
void fused_mul_div_sin_div_mul_mul(float* t_, float* t__, float* aten_mul, float* aten_mul_1, float* aten_sin, float* aten_div, float* aten_mul_2, float* const_self.model.func.radial_basis.basis.bessel_weights) {
{
  if (512 * blockIdx.x + threadIdx.x<241728 ? 1 : 0) {
    float const_self.model.func.radial_basis.basis.bessel_weights_1 = __ldg(const_self.model.func.radial_basis.basis.bessel_weights + (512 * blockIdx.x + threadIdx.x) % 8);
    float t___1 = __ldg(t__ + (512 * blockIdx.x + threadIdx.x) / 8);
    aten_mul_2[512 * blockIdx.x + threadIdx.x] = const_self.model.func.radial_basis.basis.bessel_weights_1 * t___1;
    aten_div[512 * blockIdx.x + threadIdx.x] = (float)((double)(const_self.model.func.radial_basis.basis.bessel_weights_1 * t___1) / 6.0);
    aten_sin[512 * blockIdx.x + threadIdx.x] = sinf((float)((double)(const_self.model.func.radial_basis.basis.bessel_weights_1 * t___1) / 6.0));
    aten_mul_1[512 * blockIdx.x + threadIdx.x] = (float)((double)((sinf((float)((double)(const_self.model.func.radial_basis.basis.bessel_weights_1 * t___1) / 6.0))) / t___1) * 0.3333333333333333);
    float v = __ldg(t_ + (512 * blockIdx.x + threadIdx.x) / 8);
    aten_mul[512 * blockIdx.x + threadIdx.x] = (float)((double)((sinf((float)((double)(const_self.model.func.radial_basis.basis.bessel_weights_1 * t___1) / 6.0))) / t___1) * 0.3333333333333333) * v;
  }
}
}

I use the following code to create the nequip calculator:

from nequip.ase.nequip_calculator import NequIPCalculator
calc_neq = NequIPCalculator.from_deployed_model(model_path = path_model, 
                                                    species_to_type_name = {"C" : "C",
                                                                            "H" : "H",
                                                                            "N" : "N",
                                                                            "Pb": "Pb",
                                                                            "I" : "I" },
                                                    device='cuda')

I have installed nequip and other libraries with the following commands (I was using remote computing infrastructure for which Python 3.8.6 and CUDA 11.1.1 were already installed):

pip install numpy==1.19.5
pip install git+https://gitlab.com/ase/ase.git
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric==1.7.2 -f https://data.pyg.org/whl/torch-1.9.0+cu111.html
pip install "git+https://github.com/Linux-cpp-lisp/pytorch_ema@context_manager#egg=torch_ema"
pip install git+https://github.com/mir-group/nequip 
pip install wandb

With this installation, I was able to successfully train a nequip model, the info of the trained model can be found in the following file:
deployed_model_info.txt

As I have no experience with Torch, I do not understand the underlying problem. For example. is this an installation problem, is there something wrong with the nequip model or am I defining the nequip calculator wrong? Could you maybe propose possible solutions or give any insight into this issue?

Kind regards
Tom

❓ [QUESTION] Training a model with xyz dataset

Training a model with xyz dataset

Hi,
we are trying to train a model using a dataset in the format extxyz, which contains cells of bcc iron.
Here a sample of a cell:

54
Lattice="8.5008 0.0 0.0 0.0 8.5008 0.0 0.0 0.0 8.5008" Properties=species:S:1:pos:R:3:forces:R:3:Z:I:1 config_type=phonons_54_high config_name=bcc_bulk_54_0000 ecutwfc=1224.51225528 pbc="T T T" kpoints="4 4 4" degauss=0.136056917253 energy=-186887.234986
Fe     -0.00607641       0.00230075      -0.02907879      -0.73492052      -0.02561024       0.61799986       26 
Fe      1.32012806       1.31778495       1.44496509       0.44297176       1.18663711      -0.28337233       26 
Fe     -0.01845451      -0.12003023       2.89408200       0.78139089       0.95602536      -0.81690734       26 
...
(each row contains position and forces per atom; only the first 3 out of 54 atoms are shown)

Since cells of different dimensions are present (mainly 1, 54 and 128 atoms cells), energies are very different.
We obtain a validation MAE on e/N of about hundreds of meV at the end of the training, and bad performance when using the model to predict atomic properties.
The problem vanishes if we limit training to cells with the same dimension. It is related to the different sizes of the cells? Maybe is it necessary to standardize the energies before?
Attached is our configuration file: test.txt. It is a variant of minimal_eng.yaml as we are training only on energies for now.
Which is the correct setup of the configuration file to use in this case?
test.txt

🌟 [FEATURE] Multi-system dataset support & atomic charges/dipoles prediction

Is your feature request related to a problem? Please describe.
The trained model shows great performance on dataset with single system like MD17. I'm wondering if it also support the training&fitting on dataset with multiple systems.

I removed npz_fixed_field_keys in the config file and run a quick test on the sn2-reaction dataset used in PhysNet, which includes various structures of different molecules related to that reactions. It uses 0 padding to represent those molecular with fewer atoms. Here is the result.

Training
# Epoch batch         loss       loss_f       loss_e     0_f_rmse     1_f_rmse     2_f_rmse     3_f_rmse     4_f_rmse     5_f_rmse     6_f_rmse   all_f_rmse      0_f_mae      1_f_mae      2_f_mae      3_f_mae      4_f_mae      5_f_mae      6_f_mae    all_f_mae        e_mae
    1     1          nan          nan          nan            0            0            0         1.42            0            0            0        0.203            0            0            0        0.711            0            0            0        0.102          nan
    1     2          nan          nan          nan            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0          nan
    1     3          nan          nan          nan            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0          nan
    1     4          nan          nan          nan            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0          nan
    1     5          nan          nan          nan            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0          nan
    1     6          nan          nan          nan            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0          nan
    1     7          nan          nan          nan            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0          nan
    1     8          nan          nan          nan            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0            0          nan

Also in order to model long range interactions, properties like atomic charges and dipoles is needed. I wonder if it's possible to implement this feature based on the existing code to calculate them.
I've tried to implement it myself, but not clear exactly which part of the code should be modified. :(

Describe the solution you'd like

Multi-system dataset support
Customized loss function, including energy, atomic forces/charges and dipole.

Additional context
atomic_number in sn2 dataset

>>> print(x['Z'][:10])
[[ 6  1  1  1  9 53]
 [53 53  0  0  0  0]
 [ 6  1  1  1 35  9]
 [ 6  1  1  1  9 53]
 [ 9  9  0  0  0  0]
 [ 6  1  1  1 17 53]
 [ 6  1  1  1 17 17]
 [ 6  1  1  1 35 17]
 [ 6  1  1  1 17 17]
 [ 6  1  1  1  9 53]]

Nequip memory requirements ❓ [QUESTION]

Is there any rule of thumb for nequip memory requirements? I have a 7000 configurations dataset, (Si 64 atoms each, periodic structure), in npz format. I am trying to train on 3000 configs, but I keep getting my jobs killed by scheduler for OOM error.

How much ram are we supposed to provide?
How can I reduce ram requirement?
How to train on energies only? (I tried removing forces: 1 from loss_coeffs section but it still says that it calculated forces rms for scaling, is this expected?)

I am using the model in example.yaml but with 3 layers instead of 4.
My last attempt was 1 core, 1 A100 GPU, 100 GB ram.

❓ [QUESTION] Training a model to perform Molecular Dynamics simulations on Copper Formate System

Dear nequip developers,

We recently started training a model on a copper formate system. In essence, it was similar to the system you described in the nequip paper (48 Copper atoms and 1 formate molecule), however we used a slightly bigger system comprising 144 copper atoms and 1 formate molecule.

We generated training data similar to the approach you described, we performed nudged elastic band simulations, with 20 images, from which we chose 14 images to perform AIMD of 500 0.5fs steps at 300K. We then started training a model on this, which obtained a low prediction error on both energies and forces on the test set. However, when performing MD simulation (with LAMMPS), the model quickly broke down (we saw the formate molecule entering the copper surface, and the copper surface disintegrate badly).

We figured we needed more frames at higher temperatures to inform the model better, therefore we performed additional MD simulations at 500, 2000 and 4000 Kelvin. Besides the temperature, they were the same as the previous AIMD calculation. So, in total a dataset of 28000 structures were obtained.

These 28k structures were split into 80% training and 20% testing sets. The training set was then also divided in 80% for training and 20% for validation. The reason for this was that smaller training sets did not obtain good MD trajectories.

We used the following configuration of the network (example of yaml file):

root: results/name
run_name: name
seed: 123
dataset_seed: 456                                                                  
append: true                                                             
default_dtype: float32                                                
allow_tf32: false                                                           

# network
r_max: 5.0                                          
num_layers: 5                        

chemical_embedding_irreps_out: 32x0e                                         
feature_irreps_hidden: 32x0o + 32x0e + 32x1o + 32x1e + 32x2o + 32x2e               
irreps_edge_sh: 0e + 1o                                                          
conv_to_output_hidden_irreps_out: 16x0e              

nonlinearity_type: gate                                                  
resnet: false                                              

nonlinearity_scalars:
  e: silu
  o: tanh

nonlinearity_gates:
  e: silu
  o: tanh

# radial network basis
num_basis: 8                                  
BesselBasis_trainable: true          
PolynomialCutoff_p: 6       

# radial network
invariant_layers: 2                                         
invariant_neurons: 64                                       
avg_num_neighbors: null
use_sc: true                                                
compile_model: false                        

dataset: npz                                  
dataset_file_name: directory_to_dataset.npz       
key_mapping:
  z: atomic_numbers
  E: total_energy                                                
  F: forces                                    
  R: pos                                                      
npz_fixed_field_keys:                               
  - atomic_numbers
 
chemical_symbol_to_type:
  H: 0
  C: 1
  O: 2
  Cu: 3

# logging
wandb: true              
wandb_project: project_name                                               
wandb_resume: true                              
verbose: info                                       
log_batch_freq: 1                                      
log_epoch_freq: 1                                                        
save_checkpoint_freq: -1                                                           
save_ema_checkpoint_freq: -1                                                      

# training
n_train: 18000                                                     
n_val: 4400                                                          
learning_rate: 0.005                                                          
batch_size: 5                                                                      
max_epochs: 100000                                                                     
train_val_split: random                                                           
shuffle: false #because shuffle data beforehand                                          
metrics_key: validation_loss                                                       
use_ema: true                                                                      
ema_decay: 0.99                                                                   
ema_use_num_updates: true                                                         
report_init_validation: false

early_stopping_patiences:                                                       
  validation_loss: 50
early_stopping_delta:                                                             
  validation_loss: 0.01
early_stopping_cumulative_delta: false                                           
early_stopping_lower_bounds:                                                       
  LR: 1.0e-6
early_stopping_upper_bounds:                                                       
  wall: 1.0e+100

# loss function
loss_coeffs:                                                                       
  forces: 21904 #i.e. N^2 (=148^2)                                                                       
  total_energy:                                                                    
    - 1
    - MSELoss

metrics_components:
  - - forces                          
    - rmse                               
    - PerSpecies: True                  
      report_per_component: False      
  - - forces
    - mae
    - PerSpecies: True
      report_per_component: False
  - - total_energy
    - mae
    - PerAtom: True             
  - - total_energy
    - mae
    - PerAtom: False

# optimizer
optimizer_name: Adam                                                        
optimizer_amsgrad: true
optimizer_betas: !!python/tuple
  - 0.9
  - 0.999
optimizer_eps: 1.0e-08
optimizer_weight_decay: 0

max_gradient_norm: null

lr_scheduler_name: ReduceLROnPlateau
lr_scheduler_patience: 100
lr_scheduler_factor: 0.5

per_species_rescale_scales_trainable: false
per_species_rescale_shifts_trainable: false
per_species_rescale_shifts: dataset_per_atom_total_energy_mean

global_rescale_shift: null
global_rescale_scale: dataset_forces_rms
global_rescale_shift_trainable: false
global_rescale_scale_trainable: false

Early stopping was triggered after approximately 1350 epochs and obtained the following errors (eV). These are quite okay, only maybe that E_mae could be better but that is due to the loss function (weight ratio F:E = 148^2:1).

            0_f_rmse =  0.213254
            1_f_rmse =  0.098634
            2_f_rmse =  0.097583
            3_f_rmse =  0.046718
          all_f_rmse =  0.114047
             0_f_mae =  0.075106
             1_f_mae =  0.043834
             2_f_mae =  0.042400
             3_f_mae =  0.025226
           all_f_mae =  0.046642
             e/N_mae =  0.035423
               e_mae =  5.242538

Unfortunately, upon performing MD simulations, the system was again acting in a non-physical way.
Below are some screenshots at start, 50fs and 150fs, respectively:

We used the following MD example configuration in LAMMPS:

units real
newton off
read_data structure.data

pair_style nequip
pair_coeff * * model-deployed.pth C Cu H O
mass  1 12.0107
mass  2 63.546
mass  3 1.00794
mass  4 15.999

# Run MD
timestep 1.0
dump 1 all custom 10 traj_nvt.lammpstrj id type x y z ix iy iz
velocity all create 300.0 4928459 

# temp 300K and pressure 1 atm
fix fxnvt all nvt temp 300 300 100.0 tchain 4
fix fxlmom all momentum 10 linear 1 1 1

thermo 10
thermo_style custom step etotal ke temp pe ebond eangle edihed eimp evdwl ecoul elong press vol cella cellb cellc density

run 10000

write_data system_after_nvt.restart
write_data system_after_nvt.data

Do you have any insights into what might be going wrong? Have you tried (and succeeded) to perform MD simulations on "Heterogeneous catalysis of formate dehydrogenation" section of your paper?

Thanks in advance,

Jim Boelrijk and Bart de Mooij

Inconsistent runtime error when using a nequip model to preform Langevin Dynamics in ASE🐛 [BUG]

Describe the bug
When running multiple dynamic runs using the nequip calculator for ASE I sometimes have trajectories crashing and giving the error below:

Traceback (most recent call last):
  File "/home/nhattrup/Fluxional_MD/scripts/md.py", line 71, in <module>
    nvt_dyn.run(steps=args.num_steps)
  File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/ase/md/md.py", line 137, in run
    return Dynamics.run(self)
  File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/ase/optimize/optimize.py", line 156, in run
    for converged in Dynamics.irun(self):
  File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/ase/optimize/optimize.py", line 135, in irun
    self.step()
  File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/ase/md/langevin.py", line 171, in step
    forces = atoms.get_forces(md=True)
  File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/ase/atoms.py", line 788, in get_forces
    forces = self._calc.get_forces(self)
  File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/ase/calculators/abc.py", line 23, in get_forces
    return self.get_property('forces', atoms)
  File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/ase/calculators/calculator.py", line 737, in get_property
    self.calculate(atoms, [name], system_changes)
  File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/nequip/ase/nequip_calculator.py", line 108, in calculate
    data = AtomicData.from_ase(atoms=atoms, r_max=self.r_max)
  File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/nequip/data/AtomicData.py", line 427, in from_ase
    return cls.from_points(
  File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/nequip/data/AtomicData.py", line 308, in from_points
    edge_index, edge_cell_shift, cell = neighbor_list_and_relative_vec(
  File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/nequip/data/AtomicData.py", line 744, in neighbor_list_and_relative_vec
    raise ValueError(
ValueError: After eliminating self edges, no edges remain in this system.

To Reproduce
Most recent nequip version with ASE and if needed I am happy to supply the deployed nequip model I am using. Besides that below is the code I am using to generate the Langevin class to run dynamics with:

for i in range(args.samples):
        nvt_dyn = Langevin(
        atoms=atoms,
        temperature_K=args.temperature,
        timestep=args.dt * units.fs,
        friction=0.02)
        traj_file = args.dir + '/' + 'Trajectory_' + str(i) + '.traj'
        print(i, traj_file)
        MaxwellBoltzmannDistribution(atoms=atoms, temp=args.temperature * units.kB)
        ZeroRotation(atoms) # Set center of mass momentum to zero
        Stationary(atoms) # Set rotation about center of mass zero
        traj = ASETrajectory(traj_file, 'w', atoms)
        traj.write(atoms)
        nvt_dyn.attach(traj.write, interval=args.interval)
        nvt_dyn.run(steps=args.num_steps)
        traj.close()
        # reset atom positions to initial sampling geometry
        atoms.set_positions(init_xyz.copy())

Expected behavior
Should just preform Dynamics with no issues and print the associated trajectory number and path where data is being written, i.e.:

1 ../nequip/.../ASE/Trajectory_1.traj
2 ../nequip/.../ASE/Trajectory_2.traj

Environment (please complete the following information):

OS: Ubuntu
python version 3.9.12
python environment (commands are given for python interpreter):
- nequip version 0.5.4
- e3nn version 0.4.4
- pytorch version 1.12.0+cu116
(if relevant) GPU support with CUDA
- cuda Version according to nvcc Build cuda_11.6.r11.6/compiler.30978841_0
- cuda version according to PyTorch 11.6

Additional Context
For the Trajectories that do not fail, they look perfectly reasonable

Multi-GPU support exists❓ [QUESTION]

We are interested in training nequip potentials on large datasets of several million structures.
Consequently we wanted to know whether multi-gpu support exists or if someone knows whether the networks can be integrated into pytorch lightning.
best regards and thank you very much,
Jonathan
Ps: this might be related to #126

🌟 [FEATURE] Support for newer PyTorch

Is your feature request related to a problem? Please describe.
Is there a plan to support newer versions of PyTorch in the near or far future? Currently installing PyTorch from Conda restricts the python version since there is no PyTorch=1.11 for Python>=3.10 as far as I can tell.

🌟 [FEATURE] `utils.config` as dataclass?

Is your feature request related to a problem? Please describe.
Whilst using nequip myself I was going through the code to better understand the parameter handling.
I'm not using wandb but DVC with ZnTrack and therefore, I'm currently using two nested subprocess calls (The first is DVC, the second is in ZnTrack). I was looking for a way to train the model directly (calling train.main basically).

Whilst looking for that I was wondering if the Config object

nequip/nequip/utils/config.py

Line 45 in eb6f9bc

class Config(object):

could be replaced by a Python dataclass?
I could see multiple benefits here:

The default values could be stored in the dataclass instead of

nequip/nequip/scripts/train.py

Lines 26 to 47 in eb6f9bc

    
           default_config = dict( 
        
               root="./", 
        
               run_name="NequIP", 
        
               wandb=False, 
        
               wandb_project="NequIP", 
        
               model_builders=[ 
        
                   "SimpleIrrepsConfig", 
        
                   "EnergyModel", 
        
                   "PerSpeciesRescale", 
        
                   "ForceOutput", 
        
                   "RescaleEnergyEtc", 
        
               ], 
        
               dataset_statistics_stride=1, 
        
               default_dtype="float32", 
        
               allow_tf32=False,  # TODO: until we understand equivar issues 
        
               verbose="INFO", 
        
               model_debug_mode=False, 
        
               equivariance_test=False, 
        
               grad_anomaly_mode=False, 
        
               append=False, 
        
               _jit_bailout_depth=2,  # avoid 20 iters of pain, see https://github.com/pytorch/pytorch/issues/52286 
        
           )

The attributes could have docstrings that would help building the documentation. I like the way the config.yaml files are documented and that helped me a lot, but I think having a documented dataclass in addition could also be helpful.
autocompletion of attributes

Describe the solution you'd like
I could have a look at it if there is some general interest.

Additional context

One could use the __doc__ of the dataclass to automatically generate the yaml file including the documentation to only maintain it in one place. This would be a proof-of-concept example:

import dataclasses
import yaml
import re


@dataclasses.dataclass
class Config:
    """
    Attributes:
        seed: model seed
        dataset_seed: data set seed
    """

    seed: int = 123456
    dataset_seed: int = 31415

    @property
    def to_yaml(self) -> str:
        """Convert the dataclass to a yaml string including documentation"""
        doc_dict = {}
        data_dict = {}

        for field in dataclasses.fields(self):
            data_dict[field.name] = getattr(self, field.name)
            doc_dict[field.name] = re.search( # not the final version
                rf"(?<={field.name}:).*", self.__doc__
            ).group(0)

        yaml_string = ""
        for line in yaml.dump(data_dict, indent=4).splitlines():
            for key in doc_dict:
                if line.startswith(key):
                    yaml_string += f"{line}    # {doc_dict[key]} \n"
                    break
            else:
                yaml_string += f"{line}\n"

        return yaml_string

🐛 [BUG] Compatibility with torch1.9, geometric1.7, e3nn 0.3.2

Describe the bug
torch.jit.script cannot take dict for typing.Dict .

To Reproduce
Unit test tests/data/test_AtomicData.py::test_non_periodic_edge will fail with pytorch1.9 and torch_geometric 1.7

Environment (please complete the following information):

OS: Fedora
python version 3.8.8
python environment:
- nequip version 0.3.2
- e3nn version 0.3.2
- pytorch version 1.9.0+cu102

Pytest error message:

`========================================== FAILURES ==========================================
______________________________ test_non_periodic_edge[float32] _______________________________

CH3CHO = (Atoms(symbols='OCHCH3', pbc=False), AtomicData(edge_index=[2, 18], pos=[7, 3], num_nodes=7, atomic_numbers=[7], cell=[3, 3], edge_cell_shift=[18, 3], pbc=[3]))

def test_non_periodic_edge(CH3CHO):
    atoms, data = CH3CHO
    # check edges
    for edge in range(data.num_edges):
        real_displacement = (
            atoms.positions[data.edge_index[1, edge]]
            - atoms.positions[data.edge_index[0, edge]]
        )

      assert torch.allclose(

            data.get_edge_vectors()[edge],
            torch.as_tensor(real_displacement, dtype=torch.get_default_dtype()),
        )

tests/data/test_AtomicData.py:30:

data = AtomicData(edge_index=[2, 18], pos=[7, 3], num_nodes=7, atomic_numbers=[7], cell=[3, 3], edge_cell_shift=[18, 3], pbc=[3])

def get_edge_vectors(data: Data) -> torch.Tensor:

  data = AtomicDataDict.with_edge_vectors(AtomicData.to_AtomicDataDict(data))

E RuntimeError: with_edge_vectors() Expected a value of type 'Dict[str, Tensor]' for argument 'data' but instead found type 'dict'.
E Position: 0
E Value: {'atomic_numbers': tensor([8, 6, 1, 6, 1, 1, 1]), 'num_nodes': 7, 'pos': tensor([[ 1.2181, 0.3612, 0.0000],
E [ 0.0000, 0.4641, 0.0000],
E [-0.4772, 1.4653, 0.0000],
E [-0.9481, -0.7001, 0.0000],
E [-0.3859, -1.6342, 0.0000],
E [-1.5963, -0.6525, 0.8809],
E [-1.5963, -0.6525, -0.8809]]), 'edge_index': tensor([[0, 1, 1, 1, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6],
E [1, 3, 0, 2, 1, 5, 6, 4, 1, 3, 5, 6, 6, 4, 3, 5, 3, 4]]), 'edge_cell_shift': tensor([[0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.]]), 'pbc': tensor([False, False, False]), 'cell': tensor([[0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.]])}
E Declaration: with_edge_vectors(Dict(str, Tensor) data, bool with_lengths=True) -> (Dict(str, Tensor))
E Cast error details: Unable to cast Python instance to C++ type (compile in debug mode for details)

nequip/data/AtomicData.py:253: RuntimeError
`

TypeError when reading ASE dataset

Describe the bug
Using an ASE dataset for configs/minimal.yaml instead of npz results in the following:

Loaded data: Batch(atomic_numbers=[21000], batch=[21000], cell=[1000, 3, 3], edge_cell_shift=[220186, 3], edge_index=[2, 220186], pbc=[1000, 3], pos=[21000, 3], ptr=[1001])
Successfully loaded the data set of type ASEDataset(1000)...
Traceback (most recent call last):
  File "/home/kkly2/anaconda3/envs/nequip/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/home/kkly2/anaconda3/envs/nequip/lib/python3.8/site-packages/nequip/scripts/train.py", line 40, in main
    fresh_start(parse_command_line(args))
  File "/home/kkly2/anaconda3/envs/nequip/lib/python3.8/site-packages/nequip/scripts/train.py", line 148, in fresh_start
    stats = trainer.dataset_train.statistics(
  File "/home/kkly2/anaconda3/envs/nequip/lib/python3.8/site-packages/nequip/data/dataset.py", line 328, in statistics
    elif len(arr) == self.data.num_graphs:
TypeError: object of type 'NoneType' has no len()

To Reproduce
I just took configs/minimal.yaml and modified just the data part:

$ diff minimal.yaml ~/nequip/configs/minimal.yaml
15,16c15,16
< dataset: ase
< dataset_file_name: ../MD17/aspirin_ccsd-train.xyz
---
> dataset: aspirin
> dataset_file_name: benchmark_data/aspirin_ccsd-train.npz

where aspirin_ccsd-train.xyz is from here (I assume the .npz is the same dataset?).

Expected behavior
As is, configs/minimal.yaml works for me without any errors, so I expected the same here when using ASE.

Environment (please complete the following information):

OS: Fedora 25
python version 3.8.3
python environment (commands are given for python interpreter):
- nequip version 0.3.3
- e3nn version 0.3.5
- pytorch version 1.9.1

metrics files contains column names with leading blank space

The *.csv files contain a blank space before the key value.
E.g. reading them in with pandas requires to add this blank which can be confusing for people not using wandb.

df = pd.read_csv("metrics_epoch.csv")
df["training_loss"].plot() # -> raises KeyError
df[" training_loss"].plot() # works

Would it be possible to change the header formatting found in the following lines to not contain the leading blank space?

nequip/nequip/train/trainer.py

Line 1071 in eb6f9bc

header += f", {category}_{key}"

nequip/nequip/train/trainer.py

Line 1079 in eb6f9bc

header += f", {category}_{key}"

Reduce LR on plateau but not increase

The ReduceLROnPlateau option actually looks for the loss to increase, not plateau. As long as it doesn't increase, the learning rate doesn't change.

In training my model I never see the loss increase. It just keeps decreasing by tinier and tinier amounts. Could we add a margin to the test, so for example I could tell it to reduce the learning rate any time the loss decreases by less than 2%?

🌟 [FEATURE] Stress tensor

Is your feature request related to a problem? Please describe.
In order to execute npt-simulations, the stress of the system is needed, I wonder if it is possible to implement the option to calculate the stress. To calculate this stress, the derivative of the energy wrt the cell is needed, In evaluation mode however, the model does not store the gradients and therefore, this value can not be computed. In training mode everything is rescaled, and using training mode in inference is not the purpose of training mode I think. Adding a cell and stress key in GradientOutput causes an error that there is no key 'cell' in irreps_in[wrt] in line 62 in _grad_output.py. As I figure that the cell should not be included in the irreps_in, this options seems bad too.

Describe the solution you'd like
Method on the final_model (RescaleOutput) or GradientOutput to calculate stress, if a stress parameter in the config file is True

Additional context
In schnetpack this is implemented in https://github.com/atomistic-machine-learning/schnetpack/blob/master/src/schnetpack/atomistic/output_modules.py and https://github.com/atomistic-machine-learning/schnetpack/blob//src/schnetpack/atomistic/model.py

🌟 [FEATURE] Add conda installation support

Is your feature request related to a problem? Please describe.
The only way right now seems to be to download the repo and install nequip local using pip. However, mixing conda and pip in a conda virtual environment risks messing up packages installed by those two package managers.

Describe the solution you'd like
It would be awesome if you guys can add support for conda installation.

Describe alternatives you've considered
Alternatively, hosting on pypi also resolves the issue since I can simply use the conda skeleton command to build and install using conda.

Additional context

🐛 [BUG] Crash when using large dataset

Describe the bug
I am trying to train a model on a large dataset with over 700,000 conformations for a diverse set of molecules. I created a dataset in extxyz format as described at #89. The file is about 1.5 GB. When I run nequip-train, it displays the message, "Processing...", shows no further output for about 20 minutes, and then exits with the message, "Killed".

I also tried using a subset of only about the first 200 conformations from the file, and that worked. I suspect the problem is caused by running out of memory or some other resource. Is there any supported way of handling large datasets like this?

To Reproduce
The dataset is much too large to attach, but if it would be helpful I can find a different way of sending it to you.

Environment (please complete the following information):

OS: [e.g. Ubuntu, Windows] Ubuntu 18.04
python version (python --version) 3.9
python environment (commands are given for python interpreter):
- nequip version (import nequip; nequip.__version__) 0.5.4
- e3nn version (import e3nn; e3nn.__version__) 0.4.4
- pytorch version (import torch; torch.__version__) 1.10.0
(if relevant) GPU support with CUDA
- cuda Version according to nvcc (nvcc --version)
- cuda version according to PyTorch (import torch; torch.version.cuda)

🐛 [BUG] Segfault with float64 models

Describe the bug
Nequip models always segfault at prediction time when trained to float64 precision in ASE or LAMMPS on GPU and CPU

To Reproduce

calc = NequIP(...)
atoms = bulk(...)
calc.get_potential_energy(atoms)

Terminal output

/home/mphuthi_andrew_cmu_edu/miniconda3/envs/nequip/lib/python3.9/site-packages/nequip/utils/_global_options.py:58: UserWarning: Setting the GLOBAL value for jit fusion strategy to `[('DYNAMIC', 3)]` which is different than the previous value of `[('STATIC', 2), ('DYNAMIC', 10)]`
  warnings.warn(
/home/mphuthi_andrew_cmu_edu/miniconda3/envs/nequip/lib/python3.9/site-packages/nequip/ase/nequip_calculator.py:73: UserWarning: Trying to use chemical symbols as NequIP type names; this may not be correct for your model! To avoid this warning, please provide `species_to_type_name` explicitly.
  warnings.warn(
Segmentation fault

Expected behavior
No segfault

Environment (please complete the following information):

OS: CentOS
python version: 3.9.0
python environment (commands are given for python interpreter):
- nequip version: 0.5.6
- e3nn version: 0.5.0
- pytorch version: 1.11.0
(if relevant) GPU support with CUDA
- cuda Version according to nvcc: 11.3
- cuda version according to PyTorch: 11.3

Additional context
I have never been able to get float64 models to work even with other build recipes in the past and with Allegro but they train fine.

🌟 [FEATURE] Missing features for OpenMM support (neighborlist, etc.)

Is your feature request related to a problem? Please describe.
I would like to do long md simulations using OpenMM, with https://github.com/openmm/openmm-torch it is possible to add forces to a model using a Torchscript. However, the input has to be the positions+box_vectors.
To this end, the neighborlist calculations should be included in the Torchscript module.

Describe the solution you'd like
A Torchscript model with as input the positions+ boxvectors, and output, the energy. To this end, a neighborlist would have to be computed on the gpu, for as far as I know.

❓ [QUESTION] Train models in single precision, but evaluate them in double precision

Is it possible to train models in single precision, but deploy them in double precision? I'm trying to use my single-precision-trained models to compute some finite differences, and this is typically only possible when the output of the models uses double precision.

🐛 [BUG] NotADirectoryError when attempting to run git

Describe the bug
A clear and concise description of what the bug is.

To Reproduce

After compiling nequip 0.5.5 from source, when trying to use it we see this error:

$ nequip-train config.yaml
Traceback (most recent call last):
  File "/ccc/work/cont003/gen7069/couderfx/nequip/bin/nequip-train", line 33, in <module>
    sys.exit(load_entry_point('nequip==0.5.5', 'console_scripts', 'nequip-train')())
  File "/ccc/work/cont003/gen7069/couderfx/nequip/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip/scripts/train.py", line 72, in main
  File "/ccc/work/cont003/gen7069/couderfx/nequip/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip/scripts/train.py", line 120, in fresh_start
  File "/ccc/work/cont003/gen7069/couderfx/nequip/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip/utils/versions.py", line 48, in check_code_version
  File "/ccc/work/cont003/gen7069/couderfx/nequip/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip/utils/versions.py", line 42, in get_current_code_versions
  File "/ccc/work/cont003/gen7069/couderfx/nequip/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip/utils/versions.py", line 42, in <dictcomp>
  File "/ccc/work/cont003/gen7069/couderfx/nequip/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip/utils/git.py", line 13, in get_commit
  File "/ccc/products2/python3-3.8.10/Rhel_8__x86_64/gcc--8.3.0__openmpi--4.0.1/cuda/lib/python3.8/subprocess.py", line 493, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/ccc/products2/python3-3.8.10/Rhel_8__x86_64/gcc--8.3.0__openmpi--4.0.1/cuda/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/ccc/products2/python3-3.8.10/Rhel_8__x86_64/gcc--8.3.0__openmpi--4.0.1/cuda/lib/python3.8/subprocess.py", line 1704, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
NotADirectoryError: [Errno 20] Not a directory: '/ccc/work/cont003/gen7069/couderfx/nequip/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip/..'

This does not appear to depend on the details of the config.yaml file, any example from nequip itself will reproduce the issue.

The error is that get_commit calls git with a cwd set by appending /.. to $VENV/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip (where VENV is the virtualenv where nequip is installed). But that fails, because nequip-0.5.5-py3.8.egg is not actually a directory, it's a ZIP file, so the OS does not recognise $VENV/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip/.. as a valid directory (and rightly so).

Environment (please complete the following information):

OS: CentOS Linux release 8.4.2105
Python 3.8.10
python environment (commands are given for python interpreter):
- nequip version: 0.5.5
- e3nn version: 0.5.0
- pytorch version: 1.10.2

Additional context

nequip was installed in the virtualenv from source by running: python setup.py install

❓ [QUESTION] How to perform transfer training with nequip?

I have trained a nequip model M with a large dataset A. What can I do to make M also applicatable to a small but different dataset B? (except training a new model with A+B)

wrong ValueError text

nequip/nequip/train/trainer.py

Line 1180 in 41d6b2d

raise ValueError("Not enough data in dataset for requested n_train")

This line should have n_val instead of n_train in the error text

🐛 [BUG] Error during training with training set of different cell size

Environment I used is

OS : CentOS
python version : 3.8
nequip version : 0.5.4
e3nn version : 0.4.4
pytorch version : 1.10.1
cuda version : 11.2

During the training, I tried to use the train set of multiple cell size. (for example some training set of 120 atoms and some training set of 60 atoms) Then the training ended with the errors below.

instantiate NpzDataset
   optional_args :                                         key_mapping
   optional_args :                                npz_fixed_field_keys
   optional_args :                                                root
   optional_args :                                  extra_fixed_fields <-                         dataset_extra_fixed_fields
   optional_args :                                           file_name <-                                  dataset_file_name
...NpzDataset_param = dict(
...   optional_args = {'key_mapping': {'z': 'atomic_numbers', 'E': 'total_energy', 'F': 'forces', 'R': 'pos'}, 'include_keys': [], 'npz_fixed_field_keys': ['atomic_numbers'], 'file_name': './train_set.npz', 'url': None, 'force_fixed_keys': [], 'extra_fixed_fields': {'r_max': 4.0}, 'include_frames': None, 'root': 'results/GeSe2'},
...   positional_args = {'type_mapper': <nequip.data.transforms.TypeMapper object at 0x2b9f505d7490>})
Traceback (most recent call last):
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/auto_init.py", line 232, in instantiate
    instance = builder(**positional_args, **final_optional_args)
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 681, in __init__
    super().__init__(
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 123, in __init__
    super().__init__(root=root, transform=type_mapper)
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/torch_geometric/dataset.py", line 90, in __init__
    self._process()
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/torch_geometric/dataset.py", line 175, in _process
    self.process()
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 269, in process
    data_list = [
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 270, in <listcomp>
    constructor(**{**{f: v[i] for f, v in fields.items()}, **fixed_fields})
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/AtomicData.py", line 326, in from_points
    return cls(edge_index=edge_index, pos=torch.as_tensor(pos), **kwargs)
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/AtomicData.py", line 221, in __init__
    _process_dict(kwargs)
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/AtomicData.py", line 163, in _process_dict
    raise ValueError(
ValueError: atomic_numbers is a node field but has the wrong dimension torch.Size([72, 1])

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gshs12051/anaconda3/envs/pytorch/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/scripts/train.py", line 74, in main
    trainer = fresh_start(config)
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/scripts/train.py", line 177, in fresh_start
    dataset = dataset_from_config(config, prefix="dataset")
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/_build.py", line 78, in dataset_from_config
    instance, _ = instantiate(
  File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/auto_init.py", line 234, in instantiate
    raise RuntimeError(
RuntimeError: Failed to build object with prefix `dataset` using builder `NpzDataset`

RuntimeError when using nequip-evaluate

I'm trying to use nequip-evaluate with a different dataset read through a yaml file and with a deployed potential using the command:

nequip-evaluate --model deployed_it9.pth --dataset-config predict.yaml --output bulk_nn.xyz --batch-size 1

It command runs for a few configurations of data and then gives an error:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/py/bin/nequip-evaluate", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/anaconda3/envs/py/lib/python3.9/site-packages/nequip/scripts/evaluate.py", line 372, in main
    out = model(AtomicData.to_AtomicDataDict(batch))
  File "/home/ubuntu/anaconda3/envs/py/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Unsupported value kind: Tensor

🌟 [FEATURE] supporting multiple dataset with vasp OUTCAR

Is your feature request related to a problem? Please describe.
Im trying to train the vasp output of two different systems (same solvent different solutes).

I have combined the two OUTCAR files with cat command and tried to train the data and got the following error.

Traceback (most recent call last):
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/utils/auto_init.py", line 232, in instantiate
instance = builder(**positional_args, **final_optional_args)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/data/dataset.py", line 796, in init
super().init(
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/data/dataset.py", line 123, in init
super().init(root=root, transform=type_mapper)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/utils/torch_geometric/dataset.py", line 90, in init
self._process()
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/utils/torch_geometric/dataset.py", line 175, in _process
self.process()
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/data/dataset.py", line 206, in process
data = self.get_data()
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/data/dataset.py", line 857, in get_data
atoms_list = self.get_atoms()
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/data/dataset.py", line 852, in get_atoms
return aseread(self.raw_dir + "/" + self.raw_file_names[0], **self.ase_args)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/formats.py", line 733, in read
return list(_iread(filename, index, format, io, parallel=parallel,
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/parallel.py", line 275, in new_generator
for result in generator(*args, **kwargs):
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/formats.py", line 803, in _iread
for dct in io.read(fd, *args, **kwargs):
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/formats.py", line 559, in wrap_read_function
for atoms in read(filename, index, **kwargs):
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/utils/init.py", line 486, in iofunc
obj = func(fd, *args, **kwargs)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/vasp.py", line 270, in read_vasp_out
return list(g)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/utils.py", line 246, in call
yield chunk.build(**kwargs)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/vasp_parsers/vasp_outcar_parsers.py", line 710, in build
return self.parser.build(self.lines)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/vasp_parsers/vasp_outcar_parsers.py", line 593, in build
results = self.parse(lines)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/vasp_parsers/vasp_outcar_parsers.py", line 527, in parse
prop = parser.parse(cursor, lines)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/vasp_parsers/vasp_outcar_parsers.py", line 436, in parse
assert 'spin component' in lines[cursor]
AssertionError
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/data/home/reddy/anaconda3/bin/nequip-train", line 8, in
sys.exit(main())
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/scripts/train.py", line 65, in main
trainer = fresh_start(config)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/scripts/train.py", line 163, in fresh_start
dataset = dataset_from_config(config, prefix="dataset")
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/data/_build.py", line 78, in dataset_from_config
instance, _ = instantiate(
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/utils/auto_init.py", line 234, in instantiate
raise RuntimeError(
RuntimeError: Failed to build object with prefix dataset using builder ASEDataset

Describe the solution you'd like
Can we combine the OUTCAR files of different systems and train with nequip?

I have seen in the following thread it is possible with extxyz format.
#89

Should I convert the OUTCAR file to etxyz format if so how to do it?

Thank you.

🌟 [FEATURE] OpenMM

Is your feature request related to a problem? Please describe.
Using LAMMPS can be a bit of a pain.

Describe the solution you'd like
OpenMM is a really easy to use, Pythonic MD code that also allows a user-defined set of forces and runs ridiculously fast on one GPU. It seems they are also staring to incorporate some ML methods. It would be amazing if one could run NEQUIP MD with OpenMM.

Describe alternatives you've considered
ASE is an option, but wouldn't get the speed of a GPU-accelerated MD code. OpenMM will JIT user-defined forces to the GPU.

❓ [QUESTION] Sweeping hyperparemeters with Weights and Biases

Hi all,

I'm relatively new to the code here, is there a way to use Weights and Biases to sweep hyperparameters (like the batch size, etc.)? I've been using the following code:

sweep_id = wandb.sweep(sweep_config, project="sweep")
trainer = TrainerWandB(model=model,**dict(minimal_config))
trainer.save()
trainer.set_dataset(dataset)
wandb.agent(sweep_id, trainer.train(), count=5)

and my sweep_config looks like this:

{'method': 'random',
 'metric': {'goal': 'minimize', 'name': 'validation_e'},
 'parameters': {'batch_size': {'distribution': 'q_log_uniform_values',
                               'max': 256,
                               'min': 32,
                               'q': 8},
                'learning_rate': {'distribution': 'uniform',
                                  'max': 0.1,
                                  'min': 0}}}

The first iteration runs fine, but the subsequent runs fail

❓Installation instructions for pytorch-geometric

In the Installation section in https://github.com/mir-group/nequip/blob/main/README.md,
there is a link to installation instructions of pytorch-geometric (https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html), which points to the instructions for the current version and not the 1.7.2, which is required by nequip.

I suggest you consider changing the link to the instructions for the version 1.7.2 at https://pytorch-geometric.readthedocs.io/en/1.7.2/notes/installation.html

🐛 [BUG] e3nn 0.3.5 may not be compatible with nequip 0.5.5

Describe the bug
I just upgraded from nequip 0.5.3 to 0.5.5 and found that I was getting an error in torch.einsum in e3nn.o3.Linear.

Here is the error message

Traceback (most recent call last):
  File "/Users/emil/anaconda3/envs/nequip/lib/python3.9/site-packages/nequip/utils/auto_init.py", line 241, in instantiate
    instance = builder(**positional_args, **final_optional_args)
  File "/Users/emil/anaconda3/envs/nequip/lib/python3.9/site-packages/nequip/nn/_atomwise.py", line 50, in __init__
    self.linear = Linear(
  File "/Users/emil/anaconda3/envs/nequip/lib/python3.9/site-packages/e3nn/o3/_linear.py", line 178, in __init__
    graphmod, self.weight_numel, self.bias_numel = _codegen_linear(
  File "/Users/emil/anaconda3/envs/nequip/lib/python3.9/site-packages/e3nn/o3/_linear.py", line 416, in _codegen_linear
    ein_out = torch.einsum(f"{z}uw,zui->zwi", w, x_list[ins.i_in])
  File "/Users/emil/anaconda3/envs/nequip/lib/python3.9/site-packages/torch/functional.py", line 351, in einsum
    return handle_torch_function(einsum, operands, equation, *operands)
  File "/Users/emil/anaconda3/envs/nequip/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
    result = torch_func_method(public_api, types, args, kwargs)
  File "/Users/emil/anaconda3/envs/nequip/lib/python3.9/site-packages/torch/fx/proxy.py", line 309, in __torch_function__
    raise RuntimeError(f'Found multiple different tracers {list(tracers.keys())} while '
RuntimeError: Found multiple different tracers [<torch.fx.proxy.GraphAppendingTracer object at 0x7fb4de421a60>, <torch.fx.proxy.GraphAppendingTracer object at 0x7fb4de421850>] while trying to trace operations <function einsum at 0x7fb4da671040>

To Reproduce
initializing a model should throw the error

model = nequip.model.model_from_config

Expected behavior
No error

🐛 [BUG] Different results when using model.train() and model.eval()

Describe the bug
I have a trained nequip model and try to evaluate it but am getting a value quite different than I expected. After some debugging, I realized that I wasn't setting the model to evaluate. So, i realized that if the model is in the training mode I get a different value that if it is in evaluation mode.

To Reproduce
Minimal code to reproduce the behavior. Please be try to isolate the code producing the error code from code specific to your task but not necessarily relevant to the error (e.g. replacing input data with random inputs instead of data from files).

with a trained nequip model,
model.train()(data) != model.eval()(data)

Expected behavior
I would expect the two to give the same result

Environment (please complete the following information):

OS: linux
python version Python 3.9.6
python environment (commands are given for python interpreter):
- nequip version '0.5.3'
- e3nn version '0.3.5'
- pytorch version '1.9.0+cu102'
(if relevant) GPU support with CUDA
- cuda Version according to nvcc (nvcc --version)
- cuda version according to PyTorch '10.2'

Additional context
Add any other context about the problem here.

🐛 [BUG] Issue with AtomicDataset process() function

Describe the bug
When generating a dataset (in my case from npz), I get the following error message

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\maxwe\\AppData\\Local\\Temp\\tmpahynxb3w'

With the following traceback

Traceback (most recent call last):
  File "C:\Users\maxwe\SPIN_materials\nequip\utils\savenload.py", line 39, in _process_moves 
    shutil.move(from_name, tmp_path)
  File "C:\Users\maxwe\anaconda3\envs\mat_env\lib\shutil.py", line 834, in move
    os.unlink(src)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\maxwe\\AppData\\Local\\Temp\\tmpahynxb3w'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\maxwe\SPIN_materials\tests.py", line 31, in <module>
    test = NpzDataset(root='./', file_name='./tempout.npz', include_keys=['sid', 'cell', 'pbc', 'r_max', 'pos', 'total_energy'])
  File "C:\Users\maxwe\SPIN_materials\nequip\data\dataset.py", line 701, in __init__
    super().__init__(
  File "C:\Users\maxwe\SPIN_materials\nequip\data\dataset.py", line 166, in __init__
    super().__init__(root=root, type_mapper=type_mapper)
  File "C:\Users\maxwe\SPIN_materials\nequip\data\dataset.py", line 50, in __init__
    super().__init__(root=root, transform=type_mapper)
  File "C:\Users\maxwe\SPIN_materials\nequip\utils\torch_geometric\dataset.py", line 91, in __init__
    self._process()
  File "C:\Users\maxwe\SPIN_materials\nequip\utils\torch_geometric\dataset.py", line 176, in 
_process
    self.process()
  File "C:\Users\maxwe\SPIN_materials\nequip\data\dataset.py", line 306, in process
    with atomic_write(self.processed_paths[0], binary=True) as f:
  File "C:\Users\maxwe\anaconda3\envs\mat_env\lib\contextlib.py", line 142, in __exit__      
    next(self.gen)
  File "C:\Users\maxwe\SPIN_materials\nequip\utils\savenload.py", line 182, in atomic_write  
    _submit_move(Path(tp.name), Path(fname), blocking=blocking)
  File "C:\Users\maxwe\SPIN_materials\nequip\utils\savenload.py", line 128, in _submit_move  
    _process_moves([obj])
  File "C:\Users\maxwe\SPIN_materials\nequip\utils\savenload.py", line 43, in _process_moves 
    _delete_files_if_exist([m[1] for m in moves])
  File "C:\Users\maxwe\SPIN_materials\nequip\utils\savenload.py", line 25, in _delete_files_if_exist
    f.unlink(missing_ok=True)
  File "C:\Users\maxwe\anaconda3\envs\mat_env\lib\pathlib.py", line 1204, in unlink
    self._accessor.unlink(self)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\maxwe\\AppData\\Local\\Temp\\tmpahynxb3w'

To Reproduce
All I need to reproduce is

from nequip.data.dataset import NpzDataset
test = NpzDataset(root='./', file_name='./temp.npz', include_keys=['sid', 'cell', 'pbc', 'r_max', 'pos', 'total_energy'])

Expected behavior
I expected the dataset to generate (maybe with some key errors, I'm not sure), but at least without an error such as this one.

Environment (please complete the following information):

OS: Windows
python version 3.10.4
python environment (commands are given for python interpreter):
- nequip version 0.5.5
- e3nn version 0.5.0
- pytorch version 1.11.0
  On CPU

🐛 [BUG] GPU acceleration on NequIPCalculator

Describe the bug
Something in the PyTorch script fails to compile when deploying the model with the "cuda" device and a call to "get_potential_energy" is made.

To Reproduce

model = NequIPCalculator.from_deployed_model(args.model, device="cuda" if args.gpu else "cpu")
model_energy = atoms.get_potential_energy()

Expected behavior
Use the GPU to do calculation

Environment (please complete the following information):
Linux
Python 3.9.13
NequIP 0.5.5
e3nn 0.5.0
PyTorch 1.9.1+cu111
Cuda 11.1
Running on A100 GPU

Traceback

  File "/home/tgmaxson/mambaforge/envs/meta_learning_a100/lib/python3.9/site-packages/ase/calculators/calculator.py", line 709, in get_potential_energy
    energy = self.get_property('energy', atoms)
  File "/home/tgmaxson/mambaforge/envs/meta_learning_a100/lib/python3.9/site-packages/ase/calculators/calculator.py", line 737, in get_property
    self.calculate(atoms, [name], system_changes)
  File "/home/tgmaxson/mambaforge/envs/meta_learning_a100/lib/python3.9/site-packages/nequip/ase/nequip_calculator.py", line 118, in calculate
    out = self.model(data)
  File "/home/tgmaxson/mambaforge/envs/meta_learning_a100/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: default_program(18): error: expected a ")"

default_program(21): error: expected a ";"

default_program(23): error: expression must have class type

default_program(24): error: expression must have class type

default_program(25): error: expression must have class type

default_program(26): error: "const_self" has already been declared in the current scope

default_program(26): error: expected a ";"

default_program(27): error: expression must have class type

default_program(27): error: expression must have class type

default_program(28): error: "const_self" has already been declared in the current scope

default_program(28): error: expected a ";"

default_program(29): error: expression must have class type

default_program(29): error: expression must have class type

default_program(29): error: expression must have class type

default_program(31): error: expression must have class type

default_program(31): error: expression must have class type

default_program(31): error: expression must have class type

17 errors detected in the compilation of "default_program".

nvrtc compilation failed:

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)


template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}

extern "C" __global__
void fused_mul_div_sin_div_mul_sub_mul_mul(float* t_, float* t__, float* aten_mul, float* aten_mul_1, float* aten_sub, float* aten_sin, float* aten_div, float* aten_mul_2, float* const_self.model.func.radial_basis.basis._inv_std, float* const_self.model.func.radial_basis.basis._mean, float* const_self.model.func.radial_basis.basis.basis.bessel_weights) {
{
  if (512 * blockIdx.x + threadIdx.x<864 ? 1 : 0) {
    float const_self.model.func.radial_basis.basis.basis.bessel_weights_1 = __ldg(const_self.model.func.radial_basis.basis.basis.bessel_weights + (512 * blockIdx.x + threadIdx.x) % 12);
    float t___1 = __ldg(t__ + (512 * blockIdx.x + threadIdx.x) / 12);
    aten_mul_2[512 * blockIdx.x + threadIdx.x] = const_self.model.func.radial_basis.basis.basis.bessel_weights_1 * t___1;
    aten_div[512 * blockIdx.x + threadIdx.x] = (float)((double)(const_self.model.func.radial_basis.basis.basis.bessel_weights_1 * t___1) / 10.0);
    aten_sin[512 * blockIdx.x + threadIdx.x] = sinf((float)((double)(const_self.model.func.radial_basis.basis.basis.bessel_weights_1 * t___1) / 10.0));
    float const_self.model.func.radial_basis.basis._mean_1 = __ldg(const_self.model.func.radial_basis.basis._mean + (512 * blockIdx.x + threadIdx.x) % 12);
    aten_sub[512 * blockIdx.x + threadIdx.x] = (float)((double)((sinf((float)((double)(const_self.model.func.radial_basis.basis.basis.bessel_weights_1 * t___1) / 10.0))) / t___1) * 0.2) - const_self.model.func.radial_basis.basis._mean_1;
    float const_self.model.func.radial_basis.basis._inv_std_1 = __ldg(const_self.model.func.radial_basis.basis._inv_std + (512 * blockIdx.x + threadIdx.x) % 12);
    aten_mul_1[512 * blockIdx.x + threadIdx.x] = ((float)((double)((sinf((float)((double)(const_self.model.func.radial_basis.basis.basis.bessel_weights_1 * t___1) / 10.0))) / t___1) * 0.2) - const_self.model.func.radial_basis.basis._mean_1) * const_self.model.func.radial_basis.basis._inv_std_1;
    float v = __ldg(t_ + (512 * blockIdx.x + threadIdx.x) / 12);
    aten_mul[512 * blockIdx.x + threadIdx.x] = (((float)((double)((sinf((float)((double)(const_self.model.func.radial_basis.basis.basis.bessel_weights_1 * t___1) / 10.0))) / t___1) * 0.2) - const_self.model.func.radial_basis.basis._mean_1) * const_self.model.func.radial_basis.basis._inv_std_1) * v;
  }
}
}

Max Recursion Depth🐛 [BUG]

Describe the bug
In training a nequip neural network on molecules of 304 atoms, the training starts normal but after a while (around 840) epochs I get the following error.


RecursionError: maximum recursion depth exceeded while calling a Python object
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/kyukon/scratch/gent/vo/000/gvo00003/vsc43723/my_nequip/my_neq/lib/python3.8/site-packages/torch/fx/graph_module.py", line 505, in wrapped_call
    return cls_call(self, *args, **kwargs)
  File "/kyukon/scratch/gent/vo/000/gvo00003/vsc43723/my_nequip/my_neq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "<eval_with_key_4055>", line 10, in forward
    new_zeros = x.new_zeros(add);  add = None
SystemError: <method 'new_zeros' of 'torch._C._TensorBase' objects> returned a result with an error set
Call using an FX-traced Module, line 10 of the traced Module's generated forward function:
    add = getitem_1 + (32,);  getitem_1 = None
    new_zeros = x.new_zeros(add);  add = None

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    getattr_3 = x.shape
    getitem_2 = getattr_3[slice(None, -1, None)];  getattr_3 = None
RecursionError: maximum recursion depth exceeded while calling a Python object

To Reproduce
I am still trying to create a minimal working example, this error only happens after training for 10 hours at the moment.

Environment (please complete the following information):

OS: Ubuntu
python version 3.8.6
python environment (commands are given for python interpreter):
- nequip version 0.3.3
- e3nn version 0.3.3
- pytorch version 1.9.0+cu111
(if relevant) GPU support with CUDA
- cuda Version according to nvcc :v11.1.105
- cuda version according to PyTorch : 11.1

Is this a problem that has occurred before?

❓ [QUESTION] CIF files & other target properties

If this isn't an issue with the code or a request, please use our GitHub Discussions instead.

I have a database of CIF files (i.e., multiple structures), so I have created a *.xyz database in format *.extxyz, which, read in the Python notebook, appears like a list:
[Atoms(symbols='CH3FI', pbc=False, forces=...),
Atoms(symbols='CH3FI', pbc=False, forces=...),
Atoms(symbols='I2', pbc=False, forces=...),
...
]

I was wandering if it is possible to use other target properties beyond energies/forces, for instance properties related to the entire structure and not to the atoms. And what if one does not have energies and forces?

Environment

OS: Ubuntu
python version: 3.9
e3nn version: 0.5.0
pytorch version: 1.11.0+cu102
nequip version: 0.5.5

🌟 [FEATURE] Masking out some labels (e.g. constrained atoms)

BETA implemention on masks: https://github.com/mir-group/nequip/tree/masks/examples/mask_labels

See #240 for more discussion.

🐛 [BUG] Fail to restart and append: 'Trainer' object has no attribute 'iepoch'

Describe the bug
After specifying that restart: True and append: True in the .yaml I get the following error

Torch device: cuda
Successfully loaded the data set of type NpzDataset(3973)...
Successfully built the network...
! Restarting training ...
Traceback (most recent call last):
  File "trainandtest.py", line 176, in <module>
    trainer.train()
  File "/u/vdavi/.local/lib/python3.7/site-packages/nequip/train/trainer.py", line 673, in train
    while not self.stop_cond:
  File "/u/vdavi/.local/lib/python3.7/site-packages/nequip/train/trainer.py", line 765, in stop_cond
    if self.iepoch >= self.max_epochs:
AttributeError: 'Trainer' object has no attribute 'iepoch'

Looking through your code it seems like the method from_dict ( which is part of the Trainer class) isn't called. This method is where the variable iepoch is called. I'm not sure how to fix this myself. What I'm looking at is /nequip/train/trainer.py, btw.
~

To Reproduce
I don't think this is too relevant here. All I've done is added restart: True and append: True to the .yaml file shown in the Developer's tutorial. Please let me know if you think I should put it here and I will edit this part.

~
Expected behavior
I want my training session to restart and append to the previous files. But in any case, an additional question I have is with your terminology. You have "restart" and "requeuing" options. What I want is to resume a training session after it was terminated due to time constraints. I.e. if it was stopped at epoch 20 but I wanted to run 30 epochs I want to resume the training. Is this part of restarting or requeuing?

Environment:

I'll edit this part soon but for now let me tell you I've run it on my personal computer and the clusters (different environments for sure) and have the same error so I think this isn't package dependent but related to trainer.py

OS: Linux
python version 3.8.10
python environment (commands are given for python interpreter):
- nequip version 0.3.2
- e3nn version 0.3.3
- pytorch version (import torch; torch.__version__)
(if relevant) GPU support with CUDA
- cuda Version according to nvcc (nvcc --version)
- cuda version according to PyTorch (import torch; torch.version.cuda)

Additional context
Please check my comments regarding resuming a training session. Thanks!

Installing nequip issue

Describe the bug
After installing nequip using pip install ., running nequip-train configs/minimal.yaml results in an error: ModuleNotFoundError: No module named 'nequip.scripts'.

To Reproduce
pip install . after cd nequip. Then run nequip-train configs/minimal.yaml (or any other config file).

Expected behavior
Training is supposed to begin.

Environment (please complete the following information):

OS: Windows (and probably others as well)
Python 3.8
python environment (commands are given for python interpreter):
- nequip develop version
- e3nn version 0.3.2
- pytorch version 1.8.1
(if relevant) GPU support with CUDA
- cuda Version according to nvcc (nvcc --version)
- cuda version according to PyTorch (import torch; torch.version.cuda)

Additional context
Installing nequip with pip install -e . solves the issue.

❓ [QUESTION] About parity in irreducible representation

For O(3) representation, parity variable is adopted which represents reflections.
From what I've understand, when we apply (x,y,z) |-> (-x,-y,-z) action to all coordinates, every parity of feature representations should change.
However, it seems nequip.nn.embedding.SphericalHarmonicEdgeAttrs always calculate edge attribute with fixed irreducible representations.
Could you explain more about it? Thanks for your time! :)

❓ [QUESTION] Merge stress support into master

I've been using the stress branch for quite some time and I haven't experienced any issues with it so far. I was wondering whether it is possible to merge this functionality in the master branch?

lammps Installation problem❓ [QUESTION]

I am able install nequip, lammps and torch separately.
when i try to patch and reinstall the lammps i got the following error.
how to solve it ?

CMake Error in CMakeLists.txt:
Target "lammps" contains relative path in its
INTERFACE_INCLUDE_DIRECTORIES:

"MKL_INCLUDE_DIR-NOTFOUND"

CMake Error in CMakeLists.txt:
Target "lammps" contains relative path in its
INTERFACE_INCLUDE_DIRECTORIES:

"MKL_INCLUDE_DIR-NOTFOUND"

CMake Error in CMakeLists.txt:
Imported target "torch" includes non-existent path

"MKL_INCLUDE_DIR-NOTFOUND"

in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:

The path was deleted, renamed, or moved to another location.
An install or uninstall procedure did not complete successfully.
The installation package was faulty and references files it does not
provide.

CMake Error in CMakeLists.txt:
Imported target "torch" includes non-existent path

"MKL_INCLUDE_DIR-NOTFOUND"

in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:

The path was deleted, renamed, or moved to another location.
An install or uninstall procedure did not complete successfully.
The installation package was faulty and references files it does not
provide.

-- Generating done
CMake Generate step failed. Build files cannot be regenerated correctly.

🐛 [BUG] NequIP running problem on A100 machine

Describe the bug
When I use nequip, which is installed by pip, in A100 machine, error will occur:

NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

To Reproduce
run nequip-train config/minimal.yaml on A100 machine

Expected behavior
Properly running config/minimal.yaml and config/example.yaml

Environment (please complete the following information):

OS: Fedora 32
python version : 3.10.11
python environment (commands are given for python interpreter):
- nequip version : 0.5.6
- e3nn version 0.5.1
- pytorch version 1.12.1
(if relevant) GPU support with CUDA
- cuda version according to PyTorch 11.7

Additional context
when I tried: pip install --upgrade torch to update torch to 2.0.0, problem seems to be solved and example running properly

🐛 [BUG] Tutorial example runs infinitely

Describe the bug
The tutorial colab example does not finish. The training continues forever, I stopped it after 2000 epochs.

To Reproduce
Run the colab notebook referenced in https://github.com/mir-group/nequip#tutorial

Expected behavior
The training used to stop after 100 epocs, now it runs forever.

🐛 [BUG] Error when trying to use CUDA device with ASE calculator

Describe the bug
NequIP calculator fails with device='cuda' but works with device='cpu'

To Reproduce

at = Atoms(...)
calc = nequip.from_deployed_model('model.pth', device='cuda')
at.calc = calc
at.get_forces()

Expected behavior
Calculator works with either CUDA or CPU device

Environment (please complete the following information):

OS: CentOS
python version: 3.9
python environment (commands are given for python interpreter):
- nequip 0.5.4
- e3nn version 0.5.0
- pytorch version 1.9.0+cu102
(if relevant) GPU support with CUDA
- cuda Version according to nvcc 10.2
- cuda version according to PyTorch 10.2

The stack trace is:

/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/nequip/scripts/deploy.py:109: UserWarning: Loaded model had a different value for allow_tf32 than was currently set; changing the GLOBAL setting!
  warnings.warn(
/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/nequip/scripts/deploy.py:120: UserWarning: Loaded model had a different value for _jit_bailout_depth than was currently set; changing the GLOBAL setting!
  warnings.warn(
Traceback (most recent call last):
  File "/home/mphuthi/Li/models/nequip/run1-6_v0_i0-1_low/nequip_test.py", line 12, in <module>
    ph = nq.phonon_bands_and_dos(
  File "/home/mphuthi/dev/calctest/calctest/calctest.py", line 812, in phonon_bands_and_dos
    ph.run()
  File "/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/ase/phonons.py", line 201, in run
    result = self.calculate(atoms_N, disp)
  File "/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/ase/phonons.py", line 320, in calculate
    forces = self(atoms_N)
  File "/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/ase/phonons.py", line 317, in __call__
    return atoms_N.get_forces()
  File "/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/ase/atoms.py", line 788, in get_forces
    forces = self._calc.get_forces(self)
  File "/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/ase/calculators/abc.py", line 23, in get_forces
    return self.get_property('forces', atoms)
  File "/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/ase/calculators/calculator.py", line 737, in get_property
    self.calculate(atoms, [name], system_changes)
  File "/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/nequip/ase/nequip_calculator.py", line 115, in calculate
    out = self.model(data)
  File "/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: default_program(18): error: expected a ")"

default_program(21): error: expected a ";"

default_program(23): error: expression must have class type

default_program(24): error: expression must have class type

default_program(25): error: expression must have class type

default_program(26): error: expression must have class type

default_program(28): error: expression must have class type

7 errors detected in the compilation of "default_program".

nvrtc compilation failed:

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)


template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}

extern "C" __global__
void fused_mul_div_sin_div_mul_mul(float* t_, float* t__, float* aten_mul, float* aten_mul_1, float* aten_sin, float* aten_div, float* aten_mul_2, float* const_self.model.energy_model.radial_basis.basis.bessel_weights) {
{
  if (512 * blockIdx.x + threadIdx.x<464000 ? 1 : 0) {
    float const_self.model.energy_model.radial_basis.basis.bessel_weights_1 = __ldg(const_self.model.energy_model.radial_basis.basis.bessel_weights + (512 * blockIdx.x + threadIdx.x) % 8);
    float t___1 = __ldg(t__ + (512 * blockIdx.x + threadIdx.x) / 8);
    aten_mul_2[512 * blockIdx.x + threadIdx.x] = const_self.model.energy_model.radial_basis.basis.bessel_weights_1 * t___1;
    aten_div[512 * blockIdx.x + threadIdx.x] = (float)((double)(const_self.model.energy_model.radial_basis.basis.bessel_weights_1 * t___1) / 6.0);
    aten_sin[512 * blockIdx.x + threadIdx.x] = sinf((float)((double)(const_self.model.energy_model.radial_basis.basis.bessel_weights_1 * t___1) / 6.0));
    aten_mul_1[512 * blockIdx.x + threadIdx.x] = (float)((double)((sinf((float)((double)(const_self.model.energy_model.radial_basis.basis.bessel_weights_1 * t___1) / 6.0))) / t___1) * 0.3333333333333333);
    float v = __ldg(t_ + (512 * blockIdx.x + threadIdx.x) / 8);
    aten_mul[512 * blockIdx.x + threadIdx.x] = (float)((double)((sinf((float)((double)(const_self.model.energy_model.radial_basis.basis.bessel_weights_1 * t___1) / 6.0))) / t___1) * 0.3333333333333333) * v;
  }
}
}

❓ [QUESTION] Newton pair when running lammps

Hello,

I am not sure if it's the right place to ask this question. But I am getting the following error when running lammps with Nequip trained ML potential:
ERROR: Pair style NEQUIP requires newton pair off (src/pair_nequip.cpp:108)
What do you think could be the problem? I am using lammps-29Sep21 and nequip.0.5.5

Thanks!

Huge bumps in learning curves

Dear developers

I have a problem while training the MLP. I get large bumps for the MAE of the forces. I know that it is not unusual to get some bumps while training, but usually the error goes back to the original value before the bump, quite fast. This is does not happen, it takes a few hours and in certain cases a new bump occurs befor reaching the old optimal error. See for example the figure below.

I know that I could change for example learning rate, or batch size or even restart the training from the optimal value before the bump. But I was wondering if this is something you saw while training MLPs or if you know what might cause this?

I ask because I am using almost the default settings of the "full.yaml" you provided (I only changed r_max to 6.0), so I would think that the settings are already quite good. However, I got this strange behavior for two different systems (CsPbI3 and FAPbI3) and two different training sets sizes (300 and 15000 structures). In the zip.file you can find the full.yaml file and the logs of the training belonging to the figure above.

in_and_output_data.zip

Kind regards

Tom

🐛 [BUG] Cannot use training loss as metrics key

Describe the bug
The example config file includes this line:

metrics_key: validation_loss                                                       # metrics used for scheduling and saving best model. Options: `set`_`quantity`, set can be either "train" or "validation, "quantity" can be loss or anything that appears in the validation batch step header, such as f_mae, f_rmse, e_mae, e_rmse

Following those instructions, I set the value to train_loss. When I do, it fails with the exception

RuntimeError: metrics_key should start with either validation or training

Apparently it actually wants the value to be training_loss instead of train_loss. But when I change it to that, it fails with a different exception:

KeyError: 'training_loss'

It seems that some parts of the code expect one and other parts expect the other, so that neither works.

Environment (please complete the following information):

OS: [e.g. Ubuntu, Windows] Ubuntu 18.04
python version (python --version) 3.9
python environment (commands are given for python interpreter):
- nequip version (import nequip; nequip.__version__) 0.5.4
- e3nn version (import e3nn; e3nn.__version__) 0.4.4
- pytorch version (import torch; torch.__version__) 1.10.0
(if relevant) GPU support with CUDA
- cuda Version according to nvcc (nvcc --version)
- cuda version according to PyTorch (import torch; torch.version.cuda)

🐛 [BUG] TorchScript error

Describe the bug
I am trying to reproduce the MRS2021 tutorial, but with the ASE interface. It worked in the colab notebook, but I get a TorchScript error when I try to install on our cluster here at DTU Physics. Nequip is installed into a venv from the developer branch with pip -e nequip/. PyTorch is installed on the cluster using EasyBuild, it is version 1.9.0 built with the foss/2020b toolchain. The pip-installed version does not support our GPUs.

To Reproduce
The model was trained with the following script


rm -rf ./results
nequip-train ../nequip/configs/example.yaml
nequip-deploy build results/toluene/example-run-toluene toluene-deployed.pth
nequip-evaluate --train-dir results/toluene/example-run-toluene --batch-size 50

The attached script was then run, it produced the attach error message.

Expected behavior
No crash :-)

Environment (please complete the following information):

OS: CentOS 7.9, but all software built with EasyBuild (foss/2020b toolchain)
python version 3.8.6
python environment (commands are given for python interpreter):
- nequip version 0.5.0
- e3nn version 0.4.3
- pytorch version 1.9.0
(if relevant) GPU support with CUDA
- cuda Version according to nvcc Cuda compilation tools, release 11.1, V11.1.105 Build cuda_11.1.TC455_06.29190527_0
- cuda version according to PyTorch 11.1

Additional context
Add any other context about the problem here
slurm-4259922.log
.
runoptimize.py.txt

	default_config = dict(
	root="./",
	run_name="NequIP",
	wandb=False,
	wandb_project="NequIP",
	model_builders=[
	"SimpleIrrepsConfig",
	"EnergyModel",
	"PerSpeciesRescale",
	"ForceOutput",
	"RescaleEnergyEtc",
	],
	dataset_statistics_stride=1,
	default_dtype="float32",
	allow_tf32=False, # TODO: until we understand equivar issues
	verbose="INFO",
	model_debug_mode=False,
	equivariance_test=False,
	grad_anomaly_mode=False,
	append=False,
	_jit_bailout_depth=2, # avoid 20 iters of pain, see https://github.com/pytorch/pytorch/issues/52286
	)

mir-group / nequip Goto Github PK

nequip's People

Contributors

Stargazers

Watchers

Forkers

nequip's Issues

Training a model with xyz dataset

Recommend Projects

Recommend Topics

Recommend Org