mir-group / nequip Goto Github PK
View Code? Open in Web Editor NEWNequIP is a code for building E(3)-equivariant interatomic potentials
Home Page: https://www.nature.com/articles/s41467-022-29939-5
License: MIT License
NequIP is a code for building E(3)-equivariant interatomic potentials
Home Page: https://www.nature.com/articles/s41467-022-29939-5
License: MIT License
Describe the bug
In colab tutorial example running LAMMPS fails at finding ''../lammps/build/lmp'':
!cd lammps_run/ && ../lammps/build/lmp -in toluene_minimize.in
>>>/bin/bash: ../lammps/build/lmp: No such file or directory
To Reproduce
Run the colab notebook with max_epochs 200
Expected behavior
LAMMPS runs
Describe the bug
Does not work with RTX 4080 GPU
Expected behavior
Package works with latest NVIDIA GPUs, including RTX 4080
Additional context
PyTorch and CUDA of the supported by the package versions are not compatible with the latest NVIDIA GPUs. Is it possible (planned) to make NequIP compatible with the latest PyTorch and CUDA?
Hi,
I'm experiencing some weird behaviour with the dataset preprocessing step. Depending on specific combinations of lattice vectors and pbcs, this preprocessing can take an unexpected amount of time. To frame the following, I'm extracting clusters of atoms from periodic UiO-66.
Three different (very small) datasets in the added zip-file nicely illustrate this:
A very rough timing job revealed that the first and third dataset preprocess in a comparable timespan (about half a second on my machine), whereas the second dataset took almost 2 orders of magnitude longer.
Any ideas as to where this discrepancy originates from? It can be easily avoided by a small change to the dataset, however, seems like it should not occur in the first place.
FYI, I recently updated to v0.5.0, but initially encountered this issue in v0.3.3.
When I try to run a MD NVT simulation with a deployed nequip model using ASE, a get the following error:
/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/nequip/scripts/deploy.py:115: UserWarning: Loaded model had a different value for _jit_bailout_depth than was currently set; changing the GLOBAL setting!
warnings.warn(
Traceback (most recent call last):
File "yaff_neq_MD.py", line 193, in <module>
simulate(steps, step, start, atoms, calc_neq, temperature, pressure)
File "yaff_neq_MD.py", line 144, in simulate
verlet.run(steps)
File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/yaff/sampling/iterative.py", line 128, in run
if self.propagate():
File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/yaff/sampling/verlet.py", line 351, in propagate
self.epot = self.ff.compute(self.gpos, self.vtens)
File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/yaff/pes/ff.py", line 157, in compute
self.energy = self._internal_compute(my_gpos, my_vtens)
File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/yaff/pes/ff.py", line 272, in _internal_compute
result = sum([part.compute(gpos, vtens) for part in self.parts])
File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/yaff/pes/ff.py", line 272, in <listcomp>
result = sum([part.compute(gpos, vtens) for part in self.parts])
File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/yaff/pes/ff.py", line 157, in compute
self.energy = self._internal_compute(my_gpos, my_vtens)
File "yaff_neq_MD.py", line 68, in _internal_compute
energy = self.atoms.get_potential_energy() * molmod.units.electronvolt
File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/ase/atoms.py", line 731, in get_potential_energy
energy = self._calc.get_potential_energy(self)
File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/ase/calculators/abc.py", line 24, in get_potential_energy
return self.get_property(name, atoms)
File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/ase/calculators/calculator.py", line 499, in get_property
self.calculate(atoms, [name], system_changes)
File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/nequip/ase/nequip_calculator.py", line 111, in calculate
out = self.model(AtomicData.to_AtomicDataDict(data))
File "/kyukon/scratch/gent/vo/000/gvo00003/vsc42365/ForInstall/YAFF_ASE_CP2K_NEQUIP/accelgor/yacn_acc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: default_program(18): error: expected a ")"
default_program(21): error: expected a ";"
default_program(23): error: expression must have class type
default_program(24): error: expression must have class type
default_program(25): error: expression must have class type
default_program(26): error: expression must have class type
default_program(28): error: expression must have class type
7 errors detected in the compilation of "default_program".
nvrtc compilation failed:
#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)
template<typename T>
__device__ T maximum(T a, T b) {
return isnan(a) ? a : (a > b ? a : b);
}
template<typename T>
__device__ T minimum(T a, T b) {
return isnan(a) ? a : (a < b ? a : b);
}
extern "C" __global__
void fused_mul_div_sin_div_mul_mul(float* t_, float* t__, float* aten_mul, float* aten_mul_1, float* aten_sin, float* aten_div, float* aten_mul_2, float* const_self.model.func.radial_basis.basis.bessel_weights) {
{
if (512 * blockIdx.x + threadIdx.x<241728 ? 1 : 0) {
float const_self.model.func.radial_basis.basis.bessel_weights_1 = __ldg(const_self.model.func.radial_basis.basis.bessel_weights + (512 * blockIdx.x + threadIdx.x) % 8);
float t___1 = __ldg(t__ + (512 * blockIdx.x + threadIdx.x) / 8);
aten_mul_2[512 * blockIdx.x + threadIdx.x] = const_self.model.func.radial_basis.basis.bessel_weights_1 * t___1;
aten_div[512 * blockIdx.x + threadIdx.x] = (float)((double)(const_self.model.func.radial_basis.basis.bessel_weights_1 * t___1) / 6.0);
aten_sin[512 * blockIdx.x + threadIdx.x] = sinf((float)((double)(const_self.model.func.radial_basis.basis.bessel_weights_1 * t___1) / 6.0));
aten_mul_1[512 * blockIdx.x + threadIdx.x] = (float)((double)((sinf((float)((double)(const_self.model.func.radial_basis.basis.bessel_weights_1 * t___1) / 6.0))) / t___1) * 0.3333333333333333);
float v = __ldg(t_ + (512 * blockIdx.x + threadIdx.x) / 8);
aten_mul[512 * blockIdx.x + threadIdx.x] = (float)((double)((sinf((float)((double)(const_self.model.func.radial_basis.basis.bessel_weights_1 * t___1) / 6.0))) / t___1) * 0.3333333333333333) * v;
}
}
}
I use the following code to create the nequip calculator:
from nequip.ase.nequip_calculator import NequIPCalculator
calc_neq = NequIPCalculator.from_deployed_model(model_path = path_model,
species_to_type_name = {"C" : "C",
"H" : "H",
"N" : "N",
"Pb": "Pb",
"I" : "I" },
device='cuda')
I have installed nequip and other libraries with the following commands (I was using remote computing infrastructure for which Python 3.8.6 and CUDA 11.1.1 were already installed):
pip install numpy==1.19.5
pip install git+https://gitlab.com/ase/ase.git
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric==1.7.2 -f https://data.pyg.org/whl/torch-1.9.0+cu111.html
pip install "git+https://github.com/Linux-cpp-lisp/pytorch_ema@context_manager#egg=torch_ema"
pip install git+https://github.com/mir-group/nequip
pip install wandb
With this installation, I was able to successfully train a nequip model, the info of the trained model can be found in the following file:
deployed_model_info.txt
As I have no experience with Torch, I do not understand the underlying problem. For example. is this an installation problem, is there something wrong with the nequip model or am I defining the nequip calculator wrong? Could you maybe propose possible solutions or give any insight into this issue?
Kind regards
Tom
Hi,
we are trying to train a model using a dataset in the format extxyz, which contains cells of bcc iron.
Here a sample of a cell:
54
Lattice="8.5008 0.0 0.0 0.0 8.5008 0.0 0.0 0.0 8.5008" Properties=species:S:1:pos:R:3:forces:R:3:Z:I:1 config_type=phonons_54_high config_name=bcc_bulk_54_0000 ecutwfc=1224.51225528 pbc="T T T" kpoints="4 4 4" degauss=0.136056917253 energy=-186887.234986
Fe -0.00607641 0.00230075 -0.02907879 -0.73492052 -0.02561024 0.61799986 26
Fe 1.32012806 1.31778495 1.44496509 0.44297176 1.18663711 -0.28337233 26
Fe -0.01845451 -0.12003023 2.89408200 0.78139089 0.95602536 -0.81690734 26
...
(each row contains position and forces per atom; only the first 3 out of 54 atoms are shown)
Since cells of different dimensions are present (mainly 1, 54 and 128 atoms cells), energies are very different.
We obtain a validation MAE on e/N of about hundreds of meV at the end of the training, and bad performance when using the model to predict atomic properties.
The problem vanishes if we limit training to cells with the same dimension. It is related to the different sizes of the cells? Maybe is it necessary to standardize the energies before?
Attached is our configuration file: test.txt. It is a variant of minimal_eng.yaml as we are training only on energies for now.
Which is the correct setup of the configuration file to use in this case?
test.txt
Is your feature request related to a problem? Please describe.
The trained model shows great performance on dataset with single system like MD17. I'm wondering if it also support the training&fitting on dataset with multiple systems.
I removed npz_fixed_field_keys
in the config file and run a quick test on the sn2-reaction
dataset used in PhysNet, which includes various structures of different molecules related to that reactions. It uses 0 padding to represent those molecular with fewer atoms. Here is the result.
Training
# Epoch batch loss loss_f loss_e 0_f_rmse 1_f_rmse 2_f_rmse 3_f_rmse 4_f_rmse 5_f_rmse 6_f_rmse all_f_rmse 0_f_mae 1_f_mae 2_f_mae 3_f_mae 4_f_mae 5_f_mae 6_f_mae all_f_mae e_mae
1 1 nan nan nan 0 0 0 1.42 0 0 0 0.203 0 0 0 0.711 0 0 0 0.102 nan
1 2 nan nan nan 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 nan
1 3 nan nan nan 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 nan
1 4 nan nan nan 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 nan
1 5 nan nan nan 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 nan
1 6 nan nan nan 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 nan
1 7 nan nan nan 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 nan
1 8 nan nan nan 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 nan
Also in order to model long range interactions, properties like atomic charges and dipoles is needed. I wonder if it's possible to implement this feature based on the existing code to calculate them.
I've tried to implement it myself, but not clear exactly which part of the code should be modified. :(
Describe the solution you'd like
Additional context
atomic_number
in sn2 dataset
>>> print(x['Z'][:10])
[[ 6 1 1 1 9 53]
[53 53 0 0 0 0]
[ 6 1 1 1 35 9]
[ 6 1 1 1 9 53]
[ 9 9 0 0 0 0]
[ 6 1 1 1 17 53]
[ 6 1 1 1 17 17]
[ 6 1 1 1 35 17]
[ 6 1 1 1 17 17]
[ 6 1 1 1 9 53]]
Is there any rule of thumb for nequip memory requirements? I have a 7000 configurations dataset, (Si 64 atoms each, periodic structure), in npz format. I am trying to train on 3000 configs, but I keep getting my jobs killed by scheduler for OOM error.
forces: 1
from loss_coeffs
section but it still says that it calculated forces rms for scaling, is this expected?)I am using the model in example.yaml but with 3 layers instead of 4.
My last attempt was 1 core, 1 A100 GPU, 100 GB ram.
Dear nequip developers,
We recently started training a model on a copper formate system. In essence, it was similar to the system you described in the nequip paper (48 Copper atoms and 1 formate molecule), however we used a slightly bigger system comprising 144 copper atoms and 1 formate molecule.
We generated training data similar to the approach you described, we performed nudged elastic band simulations, with 20 images, from which we chose 14 images to perform AIMD of 500 0.5fs steps at 300K. We then started training a model on this, which obtained a low prediction error on both energies and forces on the test set. However, when performing MD simulation (with LAMMPS), the model quickly broke down (we saw the formate molecule entering the copper surface, and the copper surface disintegrate badly).
We figured we needed more frames at higher temperatures to inform the model better, therefore we performed additional MD simulations at 500, 2000 and 4000 Kelvin. Besides the temperature, they were the same as the previous AIMD calculation. So, in total a dataset of 28000 structures were obtained.
These 28k structures were split into 80% training and 20% testing sets. The training set was then also divided in 80% for training and 20% for validation. The reason for this was that smaller training sets did not obtain good MD trajectories.
We used the following configuration of the network (example of yaml file):
root: results/name
run_name: name
seed: 123
dataset_seed: 456
append: true
default_dtype: float32
allow_tf32: false
# network
r_max: 5.0
num_layers: 5
chemical_embedding_irreps_out: 32x0e
feature_irreps_hidden: 32x0o + 32x0e + 32x1o + 32x1e + 32x2o + 32x2e
irreps_edge_sh: 0e + 1o
conv_to_output_hidden_irreps_out: 16x0e
nonlinearity_type: gate
resnet: false
nonlinearity_scalars:
e: silu
o: tanh
nonlinearity_gates:
e: silu
o: tanh
# radial network basis
num_basis: 8
BesselBasis_trainable: true
PolynomialCutoff_p: 6
# radial network
invariant_layers: 2
invariant_neurons: 64
avg_num_neighbors: null
use_sc: true
compile_model: false
dataset: npz
dataset_file_name: directory_to_dataset.npz
key_mapping:
z: atomic_numbers
E: total_energy
F: forces
R: pos
npz_fixed_field_keys:
- atomic_numbers
chemical_symbol_to_type:
H: 0
C: 1
O: 2
Cu: 3
# logging
wandb: true
wandb_project: project_name
wandb_resume: true
verbose: info
log_batch_freq: 1
log_epoch_freq: 1
save_checkpoint_freq: -1
save_ema_checkpoint_freq: -1
# training
n_train: 18000
n_val: 4400
learning_rate: 0.005
batch_size: 5
max_epochs: 100000
train_val_split: random
shuffle: false #because shuffle data beforehand
metrics_key: validation_loss
use_ema: true
ema_decay: 0.99
ema_use_num_updates: true
report_init_validation: false
early_stopping_patiences:
validation_loss: 50
early_stopping_delta:
validation_loss: 0.01
early_stopping_cumulative_delta: false
early_stopping_lower_bounds:
LR: 1.0e-6
early_stopping_upper_bounds:
wall: 1.0e+100
# loss function
loss_coeffs:
forces: 21904 #i.e. N^2 (=148^2)
total_energy:
- 1
- MSELoss
metrics_components:
- - forces
- rmse
- PerSpecies: True
report_per_component: False
- - forces
- mae
- PerSpecies: True
report_per_component: False
- - total_energy
- mae
- PerAtom: True
- - total_energy
- mae
- PerAtom: False
# optimizer
optimizer_name: Adam
optimizer_amsgrad: true
optimizer_betas: !!python/tuple
- 0.9
- 0.999
optimizer_eps: 1.0e-08
optimizer_weight_decay: 0
max_gradient_norm: null
lr_scheduler_name: ReduceLROnPlateau
lr_scheduler_patience: 100
lr_scheduler_factor: 0.5
per_species_rescale_scales_trainable: false
per_species_rescale_shifts_trainable: false
per_species_rescale_shifts: dataset_per_atom_total_energy_mean
global_rescale_shift: null
global_rescale_scale: dataset_forces_rms
global_rescale_shift_trainable: false
global_rescale_scale_trainable: false
Early stopping was triggered after approximately 1350 epochs and obtained the following errors (eV). These are quite okay, only maybe that E_mae could be better but that is due to the loss function (weight ratio F:E = 148^2:1).
0_f_rmse = 0.213254
1_f_rmse = 0.098634
2_f_rmse = 0.097583
3_f_rmse = 0.046718
all_f_rmse = 0.114047
0_f_mae = 0.075106
1_f_mae = 0.043834
2_f_mae = 0.042400
3_f_mae = 0.025226
all_f_mae = 0.046642
e/N_mae = 0.035423
e_mae = 5.242538
Unfortunately, upon performing MD simulations, the system was again acting in a non-physical way.
Below are some screenshots at start, 50fs and 150fs, respectively:
We used the following MD example configuration in LAMMPS:
units real
newton off
read_data structure.data
pair_style nequip
pair_coeff * * model-deployed.pth C Cu H O
mass 1 12.0107
mass 2 63.546
mass 3 1.00794
mass 4 15.999
# Run MD
timestep 1.0
dump 1 all custom 10 traj_nvt.lammpstrj id type x y z ix iy iz
velocity all create 300.0 4928459
# temp 300K and pressure 1 atm
fix fxnvt all nvt temp 300 300 100.0 tchain 4
fix fxlmom all momentum 10 linear 1 1 1
thermo 10
thermo_style custom step etotal ke temp pe ebond eangle edihed eimp evdwl ecoul elong press vol cella cellb cellc density
run 10000
write_data system_after_nvt.restart
write_data system_after_nvt.data
Do you have any insights into what might be going wrong? Have you tried (and succeeded) to perform MD simulations on "Heterogeneous catalysis of formate dehydrogenation" section of your paper?
Thanks in advance,
Jim Boelrijk and Bart de Mooij
Describe the bug
When running multiple dynamic runs using the nequip calculator for ASE I sometimes have trajectories crashing and giving the error below:
Traceback (most recent call last):
File "/home/nhattrup/Fluxional_MD/scripts/md.py", line 71, in <module>
nvt_dyn.run(steps=args.num_steps)
File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/ase/md/md.py", line 137, in run
return Dynamics.run(self)
File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/ase/optimize/optimize.py", line 156, in run
for converged in Dynamics.irun(self):
File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/ase/optimize/optimize.py", line 135, in irun
self.step()
File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/ase/md/langevin.py", line 171, in step
forces = atoms.get_forces(md=True)
File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/ase/atoms.py", line 788, in get_forces
forces = self._calc.get_forces(self)
File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/ase/calculators/abc.py", line 23, in get_forces
return self.get_property('forces', atoms)
File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/ase/calculators/calculator.py", line 737, in get_property
self.calculate(atoms, [name], system_changes)
File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/nequip/ase/nequip_calculator.py", line 108, in calculate
data = AtomicData.from_ase(atoms=atoms, r_max=self.r_max)
File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/nequip/data/AtomicData.py", line 427, in from_ase
return cls.from_points(
File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/nequip/data/AtomicData.py", line 308, in from_points
edge_index, edge_cell_shift, cell = neighbor_list_and_relative_vec(
File "/home/nhattrup/.conda/envs/nequip/lib/python3.9/site-packages/nequip/data/AtomicData.py", line 744, in neighbor_list_and_relative_vec
raise ValueError(
ValueError: After eliminating self edges, no edges remain in this system.
To Reproduce
Most recent nequip version with ASE and if needed I am happy to supply the deployed nequip model I am using. Besides that below is the code I am using to generate the Langevin class to run dynamics with:
for i in range(args.samples):
nvt_dyn = Langevin(
atoms=atoms,
temperature_K=args.temperature,
timestep=args.dt * units.fs,
friction=0.02)
traj_file = args.dir + '/' + 'Trajectory_' + str(i) + '.traj'
print(i, traj_file)
MaxwellBoltzmannDistribution(atoms=atoms, temp=args.temperature * units.kB)
ZeroRotation(atoms) # Set center of mass momentum to zero
Stationary(atoms) # Set rotation about center of mass zero
traj = ASETrajectory(traj_file, 'w', atoms)
traj.write(atoms)
nvt_dyn.attach(traj.write, interval=args.interval)
nvt_dyn.run(steps=args.num_steps)
traj.close()
# reset atom positions to initial sampling geometry
atoms.set_positions(init_xyz.copy())
Expected behavior
Should just preform Dynamics with no issues and print the associated trajectory number and path where data is being written, i.e.:
1 ../nequip/.../ASE/Trajectory_1.traj
2 ../nequip/.../ASE/Trajectory_2.traj
Environment (please complete the following information):
Additional Context
For the Trajectories that do not fail, they look perfectly reasonable
We are interested in training nequip potentials on large datasets of several million structures.
Consequently we wanted to know whether multi-gpu support exists or if someone knows whether the networks can be integrated into pytorch lightning.
best regards and thank you very much,
Jonathan
Ps: this might be related to #126
Is your feature request related to a problem? Please describe.
Is there a plan to support newer versions of PyTorch in the near or far future? Currently installing PyTorch from Conda restricts the python version since there is no PyTorch=1.11 for Python>=3.10 as far as I can tell.
Is your feature request related to a problem? Please describe.
Whilst using nequip myself I was going through the code to better understand the parameter handling.
I'm not using wandb
but DVC with ZnTrack and therefore, I'm currently using two nested subprocess calls (The first is DVC, the second is in ZnTrack). I was looking for a way to train the model directly (calling train.main
basically).
Whilst looking for that I was wondering if the Config
object
Line 45 in eb6f9bc
could be replaced by a Python dataclass?
I could see multiple benefits here:
nequip/nequip/scripts/train.py
Lines 26 to 47 in eb6f9bc
config.yaml
files are documented and that helped me a lot, but I think having a documented dataclass in addition could also be helpful.Describe the solution you'd like
I could have a look at it if there is some general interest.
Additional context
One could use the __doc__
of the dataclass to automatically generate the yaml
file including the documentation to only maintain it in one place. This would be a proof-of-concept example:
import dataclasses
import yaml
import re
@dataclasses.dataclass
class Config:
"""
Attributes:
seed: model seed
dataset_seed: data set seed
"""
seed: int = 123456
dataset_seed: int = 31415
@property
def to_yaml(self) -> str:
"""Convert the dataclass to a yaml string including documentation"""
doc_dict = {}
data_dict = {}
for field in dataclasses.fields(self):
data_dict[field.name] = getattr(self, field.name)
doc_dict[field.name] = re.search( # not the final version
rf"(?<={field.name}:).*", self.__doc__
).group(0)
yaml_string = ""
for line in yaml.dump(data_dict, indent=4).splitlines():
for key in doc_dict:
if line.startswith(key):
yaml_string += f"{line} # {doc_dict[key]} \n"
break
else:
yaml_string += f"{line}\n"
return yaml_string
Describe the bug
torch.jit.script cannot take dict for typing.Dict .
To Reproduce
Unit test tests/data/test_AtomicData.py::test_non_periodic_edge will fail with pytorch1.9 and torch_geometric 1.7
Environment (please complete the following information):
Pytest error message:
`========================================== FAILURES ==========================================
______________________________ test_non_periodic_edge[float32] _______________________________
CH3CHO = (Atoms(symbols='OCHCH3', pbc=False), AtomicData(edge_index=[2, 18], pos=[7, 3], num_nodes=7, atomic_numbers=[7], cell=[3, 3], edge_cell_shift=[18, 3], pbc=[3]))
def test_non_periodic_edge(CH3CHO):
atoms, data = CH3CHO
# check edges
for edge in range(data.num_edges):
real_displacement = (
atoms.positions[data.edge_index[1, edge]]
- atoms.positions[data.edge_index[0, edge]]
)
assert torch.allclose(
data.get_edge_vectors()[edge],
torch.as_tensor(real_displacement, dtype=torch.get_default_dtype()),
)
tests/data/test_AtomicData.py:30:
data = AtomicData(edge_index=[2, 18], pos=[7, 3], num_nodes=7, atomic_numbers=[7], cell=[3, 3], edge_cell_shift=[18, 3], pbc=[3])
def get_edge_vectors(data: Data) -> torch.Tensor:
data = AtomicDataDict.with_edge_vectors(AtomicData.to_AtomicDataDict(data))
E RuntimeError: with_edge_vectors() Expected a value of type 'Dict[str, Tensor]' for argument 'data' but instead found type 'dict'.
E Position: 0
E Value: {'atomic_numbers': tensor([8, 6, 1, 6, 1, 1, 1]), 'num_nodes': 7, 'pos': tensor([[ 1.2181, 0.3612, 0.0000],
E [ 0.0000, 0.4641, 0.0000],
E [-0.4772, 1.4653, 0.0000],
E [-0.9481, -0.7001, 0.0000],
E [-0.3859, -1.6342, 0.0000],
E [-1.5963, -0.6525, 0.8809],
E [-1.5963, -0.6525, -0.8809]]), 'edge_index': tensor([[0, 1, 1, 1, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6],
E [1, 3, 0, 2, 1, 5, 6, 4, 1, 3, 5, 6, 6, 4, 3, 5, 3, 4]]), 'edge_cell_shift': tensor([[0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.]]), 'pbc': tensor([False, False, False]), 'cell': tensor([[0., 0., 0.],
E [0., 0., 0.],
E [0., 0., 0.]])}
E Declaration: with_edge_vectors(Dict(str, Tensor) data, bool with_lengths=True) -> (Dict(str, Tensor))
E Cast error details: Unable to cast Python instance to C++ type (compile in debug mode for details)
nequip/data/AtomicData.py:253: RuntimeError
`
Describe the bug
Using an ASE dataset for configs/minimal.yaml
instead of npz results in the following:
Loaded data: Batch(atomic_numbers=[21000], batch=[21000], cell=[1000, 3, 3], edge_cell_shift=[220186, 3], edge_index=[2, 220186], pbc=[1000, 3], pos=[21000, 3], ptr=[1001])
Successfully loaded the data set of type ASEDataset(1000)...
Traceback (most recent call last):
File "/home/kkly2/anaconda3/envs/nequip/bin/nequip-train", line 8, in <module>
sys.exit(main())
File "/home/kkly2/anaconda3/envs/nequip/lib/python3.8/site-packages/nequip/scripts/train.py", line 40, in main
fresh_start(parse_command_line(args))
File "/home/kkly2/anaconda3/envs/nequip/lib/python3.8/site-packages/nequip/scripts/train.py", line 148, in fresh_start
stats = trainer.dataset_train.statistics(
File "/home/kkly2/anaconda3/envs/nequip/lib/python3.8/site-packages/nequip/data/dataset.py", line 328, in statistics
elif len(arr) == self.data.num_graphs:
TypeError: object of type 'NoneType' has no len()
To Reproduce
I just took configs/minimal.yaml
and modified just the data part:
$ diff minimal.yaml ~/nequip/configs/minimal.yaml
15,16c15,16
< dataset: ase
< dataset_file_name: ../MD17/aspirin_ccsd-train.xyz
---
> dataset: aspirin
> dataset_file_name: benchmark_data/aspirin_ccsd-train.npz
where aspirin_ccsd-train.xyz
is from here (I assume the .npz is the same dataset?).
Expected behavior
As is, configs/minimal.yaml
works for me without any errors, so I expected the same here when using ASE.
Environment (please complete the following information):
The *.csv
files contain a blank space before the key value.
E.g. reading them in with pandas requires to add this blank which can be confusing for people not using wandb
.
df = pd.read_csv("metrics_epoch.csv")
df["training_loss"].plot() # -> raises KeyError
df[" training_loss"].plot() # works
Would it be possible to change the header formatting found in the following lines to not contain the leading blank space?
nequip/nequip/train/trainer.py
Line 1071 in eb6f9bc
nequip/nequip/train/trainer.py
Line 1079 in eb6f9bc
The ReduceLROnPlateau option actually looks for the loss to increase, not plateau. As long as it doesn't increase, the learning rate doesn't change.
In training my model I never see the loss increase. It just keeps decreasing by tinier and tinier amounts. Could we add a margin to the test, so for example I could tell it to reduce the learning rate any time the loss decreases by less than 2%?
Is your feature request related to a problem? Please describe.
In order to execute npt-simulations, the stress of the system is needed, I wonder if it is possible to implement the option to calculate the stress. To calculate this stress, the derivative of the energy wrt the cell is needed, In evaluation mode however, the model does not store the gradients and therefore, this value can not be computed. In training mode everything is rescaled, and using training mode in inference is not the purpose of training mode I think. Adding a cell and stress key in GradientOutput causes an error that there is no key 'cell' in irreps_in[wrt] in line 62 in _grad_output.py. As I figure that the cell should not be included in the irreps_in, this options seems bad too.
Describe the solution you'd like
Method on the final_model (RescaleOutput) or GradientOutput to calculate stress, if a stress parameter in the config file is True
Additional context
In schnetpack this is implemented in https://github.com/atomistic-machine-learning/schnetpack/blob/master/src/schnetpack/atomistic/output_modules.py and https://github.com/atomistic-machine-learning/schnetpack/blob//src/schnetpack/atomistic/model.py
Is your feature request related to a problem? Please describe.
The only way right now seems to be to download the repo and install nequip local using pip. However, mixing conda and pip in a conda virtual environment risks messing up packages installed by those two package managers.
Describe the solution you'd like
It would be awesome if you guys can add support for conda installation.
Describe alternatives you've considered
Alternatively, hosting on pypi also resolves the issue since I can simply use the conda skeleton command to build and install using conda.
Additional context
Describe the bug
I am trying to train a model on a large dataset with over 700,000 conformations for a diverse set of molecules. I created a dataset in extxyz format as described at #89. The file is about 1.5 GB. When I run nequip-train
, it displays the message, "Processing...", shows no further output for about 20 minutes, and then exits with the message, "Killed".
I also tried using a subset of only about the first 200 conformations from the file, and that worked. I suspect the problem is caused by running out of memory or some other resource. Is there any supported way of handling large datasets like this?
To Reproduce
The dataset is much too large to attach, but if it would be helpful I can find a different way of sending it to you.
Environment (please complete the following information):
python --version
) 3.9import nequip; nequip.__version__
) 0.5.4import e3nn; e3nn.__version__
) 0.4.4import torch; torch.__version__
) 1.10.0nvcc --version
)import torch; torch.version.cuda
)Describe the bug
Nequip models always segfault at prediction time when trained to float64 precision in ASE or LAMMPS on GPU and CPU
To Reproduce
calc = NequIP(...)
atoms = bulk(...)
calc.get_potential_energy(atoms)
Terminal output
/home/mphuthi_andrew_cmu_edu/miniconda3/envs/nequip/lib/python3.9/site-packages/nequip/utils/_global_options.py:58: UserWarning: Setting the GLOBAL value for jit fusion strategy to `[('DYNAMIC', 3)]` which is different than the previous value of `[('STATIC', 2), ('DYNAMIC', 10)]`
warnings.warn(
/home/mphuthi_andrew_cmu_edu/miniconda3/envs/nequip/lib/python3.9/site-packages/nequip/ase/nequip_calculator.py:73: UserWarning: Trying to use chemical symbols as NequIP type names; this may not be correct for your model! To avoid this warning, please provide `species_to_type_name` explicitly.
warnings.warn(
Segmentation fault
Expected behavior
No segfault
Environment (please complete the following information):
Additional context
I have never been able to get float64 models to work even with other build recipes in the past and with Allegro but they train fine.
Is your feature request related to a problem? Please describe.
I would like to do long md simulations using OpenMM, with https://github.com/openmm/openmm-torch it is possible to add forces to a model using a Torchscript. However, the input has to be the positions+box_vectors.
To this end, the neighborlist calculations should be included in the Torchscript module.
Describe the solution you'd like
A Torchscript model with as input the positions+ boxvectors, and output, the energy. To this end, a neighborlist would have to be computed on the gpu, for as far as I know.
Is it possible to train models in single precision, but deploy them in double precision? I'm trying to use my single-precision-trained models to compute some finite differences, and this is typically only possible when the output of the models uses double precision.
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
After compiling nequip 0.5.5 from source, when trying to use it we see this error:
$ nequip-train config.yaml
Traceback (most recent call last):
File "/ccc/work/cont003/gen7069/couderfx/nequip/bin/nequip-train", line 33, in <module>
sys.exit(load_entry_point('nequip==0.5.5', 'console_scripts', 'nequip-train')())
File "/ccc/work/cont003/gen7069/couderfx/nequip/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip/scripts/train.py", line 72, in main
File "/ccc/work/cont003/gen7069/couderfx/nequip/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip/scripts/train.py", line 120, in fresh_start
File "/ccc/work/cont003/gen7069/couderfx/nequip/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip/utils/versions.py", line 48, in check_code_version
File "/ccc/work/cont003/gen7069/couderfx/nequip/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip/utils/versions.py", line 42, in get_current_code_versions
File "/ccc/work/cont003/gen7069/couderfx/nequip/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip/utils/versions.py", line 42, in <dictcomp>
File "/ccc/work/cont003/gen7069/couderfx/nequip/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip/utils/git.py", line 13, in get_commit
File "/ccc/products2/python3-3.8.10/Rhel_8__x86_64/gcc--8.3.0__openmpi--4.0.1/cuda/lib/python3.8/subprocess.py", line 493, in run
with Popen(*popenargs, **kwargs) as process:
File "/ccc/products2/python3-3.8.10/Rhel_8__x86_64/gcc--8.3.0__openmpi--4.0.1/cuda/lib/python3.8/subprocess.py", line 858, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/ccc/products2/python3-3.8.10/Rhel_8__x86_64/gcc--8.3.0__openmpi--4.0.1/cuda/lib/python3.8/subprocess.py", line 1704, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
NotADirectoryError: [Errno 20] Not a directory: '/ccc/work/cont003/gen7069/couderfx/nequip/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip/..'
This does not appear to depend on the details of the config.yaml
file, any example from nequip
itself will reproduce the issue.
The error is that get_commit
calls git
with a cwd
set by appending /..
to $VENV/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip
(where VENV
is the virtualenv where nequip is installed). But that fails, because nequip-0.5.5-py3.8.egg
is not actually a directory, it's a ZIP file, so the OS does not recognise $VENV/lib/python3.8/site-packages/nequip-0.5.5-py3.8.egg/nequip/..
as a valid directory (and rightly so).
Environment (please complete the following information):
Additional context
nequip was installed in the virtualenv from source by running: python setup.py install
I have trained a nequip model M with a large dataset A. What can I do to make M also applicatable to a small but different dataset B? (except training a new model with A+B)
nequip/nequip/train/trainer.py
Line 1180 in 41d6b2d
This line should have n_val instead of n_train in the error text
Environment I used is
During the training, I tried to use the train set of multiple cell size. (for example some training set of 120 atoms and some training set of 60 atoms) Then the training ended with the errors below.
instantiate NpzDataset
optional_args : key_mapping
optional_args : npz_fixed_field_keys
optional_args : root
optional_args : extra_fixed_fields <- dataset_extra_fixed_fields
optional_args : file_name <- dataset_file_name
...NpzDataset_param = dict(
... optional_args = {'key_mapping': {'z': 'atomic_numbers', 'E': 'total_energy', 'F': 'forces', 'R': 'pos'}, 'include_keys': [], 'npz_fixed_field_keys': ['atomic_numbers'], 'file_name': './train_set.npz', 'url': None, 'force_fixed_keys': [], 'extra_fixed_fields': {'r_max': 4.0}, 'include_frames': None, 'root': 'results/GeSe2'},
... positional_args = {'type_mapper': <nequip.data.transforms.TypeMapper object at 0x2b9f505d7490>})
Traceback (most recent call last):
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/auto_init.py", line 232, in instantiate
instance = builder(**positional_args, **final_optional_args)
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 681, in __init__
super().__init__(
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 123, in __init__
super().__init__(root=root, transform=type_mapper)
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/torch_geometric/dataset.py", line 90, in __init__
self._process()
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/torch_geometric/dataset.py", line 175, in _process
self.process()
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 269, in process
data_list = [
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 270, in <listcomp>
constructor(**{**{f: v[i] for f, v in fields.items()}, **fixed_fields})
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/AtomicData.py", line 326, in from_points
return cls(edge_index=edge_index, pos=torch.as_tensor(pos), **kwargs)
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/AtomicData.py", line 221, in __init__
_process_dict(kwargs)
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/AtomicData.py", line 163, in _process_dict
raise ValueError(
ValueError: atomic_numbers is a node field but has the wrong dimension torch.Size([72, 1])
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/gshs12051/anaconda3/envs/pytorch/bin/nequip-train", line 8, in <module>
sys.exit(main())
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/scripts/train.py", line 74, in main
trainer = fresh_start(config)
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/scripts/train.py", line 177, in fresh_start
dataset = dataset_from_config(config, prefix="dataset")
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/_build.py", line 78, in dataset_from_config
instance, _ = instantiate(
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/auto_init.py", line 234, in instantiate
raise RuntimeError(
RuntimeError: Failed to build object with prefix `dataset` using builder `NpzDataset`
I'm trying to use nequip-evaluate with a different dataset read through a yaml file and with a deployed potential using the command:
nequip-evaluate --model deployed_it9.pth --dataset-config predict.yaml --output bulk_nn.xyz --batch-size 1
It command runs for a few configurations of data and then gives an error:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/py/bin/nequip-evaluate", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/anaconda3/envs/py/lib/python3.9/site-packages/nequip/scripts/evaluate.py", line 372, in main
out = model(AtomicData.to_AtomicDataDict(batch))
File "/home/ubuntu/anaconda3/envs/py/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Unsupported value kind: Tensor
Is your feature request related to a problem? Please describe.
Im trying to train the vasp output of two different systems (same solvent different solutes).
I have combined the two OUTCAR files with cat command and tried to train the data and got the following error.
Traceback (most recent call last):
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/utils/auto_init.py", line 232, in instantiate
instance = builder(**positional_args, **final_optional_args)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/data/dataset.py", line 796, in init
super().init(
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/data/dataset.py", line 123, in init
super().init(root=root, transform=type_mapper)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/utils/torch_geometric/dataset.py", line 90, in init
self._process()
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/utils/torch_geometric/dataset.py", line 175, in _process
self.process()
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/data/dataset.py", line 206, in process
data = self.get_data()
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/data/dataset.py", line 857, in get_data
atoms_list = self.get_atoms()
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/data/dataset.py", line 852, in get_atoms
return aseread(self.raw_dir + "/" + self.raw_file_names[0], **self.ase_args)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/formats.py", line 733, in read
return list(_iread(filename, index, format, io, parallel=parallel,
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/parallel.py", line 275, in new_generator
for result in generator(*args, **kwargs):
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/formats.py", line 803, in _iread
for dct in io.read(fd, *args, **kwargs):
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/formats.py", line 559, in wrap_read_function
for atoms in read(filename, index, **kwargs):
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/utils/init.py", line 486, in iofunc
obj = func(fd, *args, **kwargs)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/vasp.py", line 270, in read_vasp_out
return list(g)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/utils.py", line 246, in call
yield chunk.build(**kwargs)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/vasp_parsers/vasp_outcar_parsers.py", line 710, in build
return self.parser.build(self.lines)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/vasp_parsers/vasp_outcar_parsers.py", line 593, in build
results = self.parse(lines)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/vasp_parsers/vasp_outcar_parsers.py", line 527, in parse
prop = parser.parse(cursor, lines)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/ase/io/vasp_parsers/vasp_outcar_parsers.py", line 436, in parse
assert 'spin component' in lines[cursor]
AssertionError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/data/home/reddy/anaconda3/bin/nequip-train", line 8, in
sys.exit(main())
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/scripts/train.py", line 65, in main
trainer = fresh_start(config)
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/scripts/train.py", line 163, in fresh_start
dataset = dataset_from_config(config, prefix="dataset")
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/data/_build.py", line 78, in dataset_from_config
instance, _ = instantiate(
File "/data/home/reddy/anaconda3/lib/python3.8/site-packages/nequip/utils/auto_init.py", line 234, in instantiate
raise RuntimeError(
RuntimeError: Failed to build object with prefix dataset
using builder ASEDataset
Describe the solution you'd like
Can we combine the OUTCAR files of different systems and train with nequip?
I have seen in the following thread it is possible with extxyz format.
#89
Should I convert the OUTCAR file to etxyz format if so how to do it?
Thank you.
Is your feature request related to a problem? Please describe.
Using LAMMPS can be a bit of a pain.
Describe the solution you'd like
OpenMM is a really easy to use, Pythonic MD code that also allows a user-defined set of forces and runs ridiculously fast on one GPU. It seems they are also staring to incorporate some ML methods. It would be amazing if one could run NEQUIP MD with OpenMM.
Describe alternatives you've considered
ASE is an option, but wouldn't get the speed of a GPU-accelerated MD code. OpenMM will JIT user-defined forces to the GPU.
Hi all,
I'm relatively new to the code here, is there a way to use Weights and Biases to sweep hyperparameters (like the batch size, etc.)? I've been using the following code:
sweep_id = wandb.sweep(sweep_config, project="sweep")
trainer = TrainerWandB(model=model,**dict(minimal_config))
trainer.save()
trainer.set_dataset(dataset)
wandb.agent(sweep_id, trainer.train(), count=5)
and my sweep_config
looks like this:
{'method': 'random',
'metric': {'goal': 'minimize', 'name': 'validation_e'},
'parameters': {'batch_size': {'distribution': 'q_log_uniform_values',
'max': 256,
'min': 32,
'q': 8},
'learning_rate': {'distribution': 'uniform',
'max': 0.1,
'min': 0}}}
The first iteration runs fine, but the subsequent runs fail
In the Installation section in https://github.com/mir-group/nequip/blob/main/README.md,
there is a link to installation instructions of pytorch-geometric (https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html), which points to the instructions for the current version and not the 1.7.2, which is required by nequip.
I suggest you consider changing the link to the instructions for the version 1.7.2 at https://pytorch-geometric.readthedocs.io/en/1.7.2/notes/installation.html
Describe the bug
I just upgraded from nequip 0.5.3 to 0.5.5 and found that I was getting an error in torch.einsum
in e3nn.o3.Linear
.
Here is the error message
Traceback (most recent call last):
File "/Users/emil/anaconda3/envs/nequip/lib/python3.9/site-packages/nequip/utils/auto_init.py", line 241, in instantiate
instance = builder(**positional_args, **final_optional_args)
File "/Users/emil/anaconda3/envs/nequip/lib/python3.9/site-packages/nequip/nn/_atomwise.py", line 50, in __init__
self.linear = Linear(
File "/Users/emil/anaconda3/envs/nequip/lib/python3.9/site-packages/e3nn/o3/_linear.py", line 178, in __init__
graphmod, self.weight_numel, self.bias_numel = _codegen_linear(
File "/Users/emil/anaconda3/envs/nequip/lib/python3.9/site-packages/e3nn/o3/_linear.py", line 416, in _codegen_linear
ein_out = torch.einsum(f"{z}uw,zui->zwi", w, x_list[ins.i_in])
File "/Users/emil/anaconda3/envs/nequip/lib/python3.9/site-packages/torch/functional.py", line 351, in einsum
return handle_torch_function(einsum, operands, equation, *operands)
File "/Users/emil/anaconda3/envs/nequip/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/Users/emil/anaconda3/envs/nequip/lib/python3.9/site-packages/torch/fx/proxy.py", line 309, in __torch_function__
raise RuntimeError(f'Found multiple different tracers {list(tracers.keys())} while '
RuntimeError: Found multiple different tracers [<torch.fx.proxy.GraphAppendingTracer object at 0x7fb4de421a60>, <torch.fx.proxy.GraphAppendingTracer object at 0x7fb4de421850>] while trying to trace operations <function einsum at 0x7fb4da671040>
To Reproduce
initializing a model should throw the error
model = nequip.model.model_from_config
Expected behavior
No error
Describe the bug
I have a trained nequip model and try to evaluate it but am getting a value quite different than I expected. After some debugging, I realized that I wasn't setting the model to evaluate. So, i realized that if the model is in the training mode I get a different value that if it is in evaluation mode.
To Reproduce
Minimal code to reproduce the behavior. Please be try to isolate the code producing the error code from code specific to your task but not necessarily relevant to the error (e.g. replacing input data with random inputs instead of data from files).
with a trained nequip model,
model.train()(data) != model.eval()(data)
Expected behavior
I would expect the two to give the same result
Environment (please complete the following information):
nvcc --version
)Additional context
Add any other context about the problem here.
Describe the bug
When generating a dataset (in my case from npz), I get the following error message
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\maxwe\\AppData\\Local\\Temp\\tmpahynxb3w'
With the following traceback
Traceback (most recent call last):
File "C:\Users\maxwe\SPIN_materials\nequip\utils\savenload.py", line 39, in _process_moves
shutil.move(from_name, tmp_path)
File "C:\Users\maxwe\anaconda3\envs\mat_env\lib\shutil.py", line 834, in move
os.unlink(src)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\maxwe\\AppData\\Local\\Temp\\tmpahynxb3w'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\maxwe\SPIN_materials\tests.py", line 31, in <module>
test = NpzDataset(root='./', file_name='./tempout.npz', include_keys=['sid', 'cell', 'pbc', 'r_max', 'pos', 'total_energy'])
File "C:\Users\maxwe\SPIN_materials\nequip\data\dataset.py", line 701, in __init__
super().__init__(
File "C:\Users\maxwe\SPIN_materials\nequip\data\dataset.py", line 166, in __init__
super().__init__(root=root, type_mapper=type_mapper)
File "C:\Users\maxwe\SPIN_materials\nequip\data\dataset.py", line 50, in __init__
super().__init__(root=root, transform=type_mapper)
File "C:\Users\maxwe\SPIN_materials\nequip\utils\torch_geometric\dataset.py", line 91, in __init__
self._process()
File "C:\Users\maxwe\SPIN_materials\nequip\utils\torch_geometric\dataset.py", line 176, in
_process
self.process()
File "C:\Users\maxwe\SPIN_materials\nequip\data\dataset.py", line 306, in process
with atomic_write(self.processed_paths[0], binary=True) as f:
File "C:\Users\maxwe\anaconda3\envs\mat_env\lib\contextlib.py", line 142, in __exit__
next(self.gen)
File "C:\Users\maxwe\SPIN_materials\nequip\utils\savenload.py", line 182, in atomic_write
_submit_move(Path(tp.name), Path(fname), blocking=blocking)
File "C:\Users\maxwe\SPIN_materials\nequip\utils\savenload.py", line 128, in _submit_move
_process_moves([obj])
File "C:\Users\maxwe\SPIN_materials\nequip\utils\savenload.py", line 43, in _process_moves
_delete_files_if_exist([m[1] for m in moves])
File "C:\Users\maxwe\SPIN_materials\nequip\utils\savenload.py", line 25, in _delete_files_if_exist
f.unlink(missing_ok=True)
File "C:\Users\maxwe\anaconda3\envs\mat_env\lib\pathlib.py", line 1204, in unlink
self._accessor.unlink(self)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\maxwe\\AppData\\Local\\Temp\\tmpahynxb3w'
To Reproduce
All I need to reproduce is
from nequip.data.dataset import NpzDataset
test = NpzDataset(root='./', file_name='./temp.npz', include_keys=['sid', 'cell', 'pbc', 'r_max', 'pos', 'total_energy'])
Expected behavior
I expected the dataset to generate (maybe with some key errors, I'm not sure), but at least without an error such as this one.
Environment (please complete the following information):
Describe the bug
Something in the PyTorch script fails to compile when deploying the model with the "cuda" device and a call to "get_potential_energy" is made.
To Reproduce
model = NequIPCalculator.from_deployed_model(args.model, device="cuda" if args.gpu else "cpu")
model_energy = atoms.get_potential_energy()
Expected behavior
Use the GPU to do calculation
Environment (please complete the following information):
Linux
Python 3.9.13
NequIP 0.5.5
e3nn 0.5.0
PyTorch 1.9.1+cu111
Cuda 11.1
Running on A100 GPU
Traceback
File "/home/tgmaxson/mambaforge/envs/meta_learning_a100/lib/python3.9/site-packages/ase/calculators/calculator.py", line 709, in get_potential_energy
energy = self.get_property('energy', atoms)
File "/home/tgmaxson/mambaforge/envs/meta_learning_a100/lib/python3.9/site-packages/ase/calculators/calculator.py", line 737, in get_property
self.calculate(atoms, [name], system_changes)
File "/home/tgmaxson/mambaforge/envs/meta_learning_a100/lib/python3.9/site-packages/nequip/ase/nequip_calculator.py", line 118, in calculate
out = self.model(data)
File "/home/tgmaxson/mambaforge/envs/meta_learning_a100/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: default_program(18): error: expected a ")"
default_program(21): error: expected a ";"
default_program(23): error: expression must have class type
default_program(24): error: expression must have class type
default_program(25): error: expression must have class type
default_program(26): error: "const_self" has already been declared in the current scope
default_program(26): error: expected a ";"
default_program(27): error: expression must have class type
default_program(27): error: expression must have class type
default_program(28): error: "const_self" has already been declared in the current scope
default_program(28): error: expected a ";"
default_program(29): error: expression must have class type
default_program(29): error: expression must have class type
default_program(29): error: expression must have class type
default_program(31): error: expression must have class type
default_program(31): error: expression must have class type
default_program(31): error: expression must have class type
17 errors detected in the compilation of "default_program".
nvrtc compilation failed:
#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)
template<typename T>
__device__ T maximum(T a, T b) {
return isnan(a) ? a : (a > b ? a : b);
}
template<typename T>
__device__ T minimum(T a, T b) {
return isnan(a) ? a : (a < b ? a : b);
}
extern "C" __global__
void fused_mul_div_sin_div_mul_sub_mul_mul(float* t_, float* t__, float* aten_mul, float* aten_mul_1, float* aten_sub, float* aten_sin, float* aten_div, float* aten_mul_2, float* const_self.model.func.radial_basis.basis._inv_std, float* const_self.model.func.radial_basis.basis._mean, float* const_self.model.func.radial_basis.basis.basis.bessel_weights) {
{
if (512 * blockIdx.x + threadIdx.x<864 ? 1 : 0) {
float const_self.model.func.radial_basis.basis.basis.bessel_weights_1 = __ldg(const_self.model.func.radial_basis.basis.basis.bessel_weights + (512 * blockIdx.x + threadIdx.x) % 12);
float t___1 = __ldg(t__ + (512 * blockIdx.x + threadIdx.x) / 12);
aten_mul_2[512 * blockIdx.x + threadIdx.x] = const_self.model.func.radial_basis.basis.basis.bessel_weights_1 * t___1;
aten_div[512 * blockIdx.x + threadIdx.x] = (float)((double)(const_self.model.func.radial_basis.basis.basis.bessel_weights_1 * t___1) / 10.0);
aten_sin[512 * blockIdx.x + threadIdx.x] = sinf((float)((double)(const_self.model.func.radial_basis.basis.basis.bessel_weights_1 * t___1) / 10.0));
float const_self.model.func.radial_basis.basis._mean_1 = __ldg(const_self.model.func.radial_basis.basis._mean + (512 * blockIdx.x + threadIdx.x) % 12);
aten_sub[512 * blockIdx.x + threadIdx.x] = (float)((double)((sinf((float)((double)(const_self.model.func.radial_basis.basis.basis.bessel_weights_1 * t___1) / 10.0))) / t___1) * 0.2) - const_self.model.func.radial_basis.basis._mean_1;
float const_self.model.func.radial_basis.basis._inv_std_1 = __ldg(const_self.model.func.radial_basis.basis._inv_std + (512 * blockIdx.x + threadIdx.x) % 12);
aten_mul_1[512 * blockIdx.x + threadIdx.x] = ((float)((double)((sinf((float)((double)(const_self.model.func.radial_basis.basis.basis.bessel_weights_1 * t___1) / 10.0))) / t___1) * 0.2) - const_self.model.func.radial_basis.basis._mean_1) * const_self.model.func.radial_basis.basis._inv_std_1;
float v = __ldg(t_ + (512 * blockIdx.x + threadIdx.x) / 12);
aten_mul[512 * blockIdx.x + threadIdx.x] = (((float)((double)((sinf((float)((double)(const_self.model.func.radial_basis.basis.basis.bessel_weights_1 * t___1) / 10.0))) / t___1) * 0.2) - const_self.model.func.radial_basis.basis._mean_1) * const_self.model.func.radial_basis.basis._inv_std_1) * v;
}
}
}
Describe the bug
In training a nequip neural network on molecules of 304 atoms, the training starts normal but after a while (around 840) epochs I get the following error.
RecursionError: maximum recursion depth exceeded while calling a Python object
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/kyukon/scratch/gent/vo/000/gvo00003/vsc43723/my_nequip/my_neq/lib/python3.8/site-packages/torch/fx/graph_module.py", line 505, in wrapped_call
return cls_call(self, *args, **kwargs)
File "/kyukon/scratch/gent/vo/000/gvo00003/vsc43723/my_nequip/my_neq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "<eval_with_key_4055>", line 10, in forward
new_zeros = x.new_zeros(add); add = None
SystemError: <method 'new_zeros' of 'torch._C._TensorBase' objects> returned a result with an error set
Call using an FX-traced Module, line 10 of the traced Module's generated forward function:
add = getitem_1 + (32,); getitem_1 = None
new_zeros = x.new_zeros(add); add = None
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
getattr_3 = x.shape
getitem_2 = getattr_3[slice(None, -1, None)]; getattr_3 = None
RecursionError: maximum recursion depth exceeded while calling a Python object
To Reproduce
I am still trying to create a minimal working example, this error only happens after training for 10 hours at the moment.
Environment (please complete the following information):
Is this a problem that has occurred before?
If this isn't an issue with the code or a request, please use our GitHub Discussions instead.
I have a database of CIF files (i.e., multiple structures), so I have created a *.xyz database in format *.extxyz, which, read in the Python notebook, appears like a list:
[Atoms(symbols='CH3FI', pbc=False, forces=...)
,
Atoms(symbols='CH3FI', pbc=False, forces=...)
,
Atoms(symbols='I2', pbc=False, forces=...)
,
...
]
I was wandering if it is possible to use other target properties beyond energies/forces, for instance properties related to the entire structure and not to the atoms. And what if one does not have energies and forces?
Environment
BETA implemention on masks
: https://github.com/mir-group/nequip/tree/masks/examples/mask_labels
See #240 for more discussion.
Describe the bug
After specifying that restart: True and append: True in the .yaml I get the following error
Torch device: cuda
Successfully loaded the data set of type NpzDataset(3973)...
Successfully built the network...
! Restarting training ...
Traceback (most recent call last):
File "trainandtest.py", line 176, in <module>
trainer.train()
File "/u/vdavi/.local/lib/python3.7/site-packages/nequip/train/trainer.py", line 673, in train
while not self.stop_cond:
File "/u/vdavi/.local/lib/python3.7/site-packages/nequip/train/trainer.py", line 765, in stop_cond
if self.iepoch >= self.max_epochs:
AttributeError: 'Trainer' object has no attribute 'iepoch'
Looking through your code it seems like the method from_dict ( which is part of the Trainer class) isn't called. This method is where the variable iepoch is called. I'm not sure how to fix this myself. What I'm looking at is /nequip/train/trainer.py, btw.
~
To Reproduce
I don't think this is too relevant here. All I've done is added restart: True and append: True to the .yaml file shown in the Developer's tutorial. Please let me know if you think I should put it here and I will edit this part.
~
Expected behavior
I want my training session to restart and append to the previous files. But in any case, an additional question I have is with your terminology. You have "restart" and "requeuing" options. What I want is to resume a training session after it was terminated due to time constraints. I.e. if it was stopped at epoch 20 but I wanted to run 30 epochs I want to resume the training. Is this part of restarting or requeuing?
Environment:
I'll edit this part soon but for now let me tell you I've run it on my personal computer and the clusters (different environments for sure) and have the same error so I think this isn't package dependent but related to trainer.py
import torch; torch.__version__
)nvcc --version
)import torch; torch.version.cuda
)Additional context
Please check my comments regarding resuming a training session. Thanks!
Describe the bug
After installing nequip using pip install ., running nequip-train configs/minimal.yaml results in an error: ModuleNotFoundError: No module named 'nequip.scripts'.
To Reproduce
pip install . after cd nequip. Then run nequip-train configs/minimal.yaml (or any other config file).
Expected behavior
Training is supposed to begin.
Environment (please complete the following information):
nvcc --version
)import torch; torch.version.cuda
)Additional context
Installing nequip with pip install -e . solves the issue.
For O(3) representation, parity variable is adopted which represents reflections.
From what I've understand, when we apply (x,y,z) |-> (-x,-y,-z) action to all coordinates, every parity of feature representations should change.
However, it seems nequip.nn.embedding.SphericalHarmonicEdgeAttrs
always calculate edge attribute with fixed irreducible representations.
Could you explain more about it? Thanks for your time! :)
I've been using the stress branch for quite some time and I haven't experienced any issues with it so far. I was wondering whether it is possible to merge this functionality in the master branch?
I am able install nequip, lammps and torch separately.
when i try to patch and reinstall the lammps i got the following error.
how to solve it ?
CMake Error in CMakeLists.txt:
Target "lammps" contains relative path in its
INTERFACE_INCLUDE_DIRECTORIES:
"MKL_INCLUDE_DIR-NOTFOUND"
CMake Error in CMakeLists.txt:
Target "lammps" contains relative path in its
INTERFACE_INCLUDE_DIRECTORIES:
"MKL_INCLUDE_DIR-NOTFOUND"
CMake Error in CMakeLists.txt:
Imported target "torch" includes non-existent path
"MKL_INCLUDE_DIR-NOTFOUND"
in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:
The path was deleted, renamed, or moved to another location.
An install or uninstall procedure did not complete successfully.
The installation package was faulty and references files it does not
provide.
CMake Error in CMakeLists.txt:
Imported target "torch" includes non-existent path
"MKL_INCLUDE_DIR-NOTFOUND"
in its INTERFACE_INCLUDE_DIRECTORIES. Possible reasons include:
The path was deleted, renamed, or moved to another location.
An install or uninstall procedure did not complete successfully.
The installation package was faulty and references files it does not
provide.
-- Generating done
CMake Generate step failed. Build files cannot be regenerated correctly.
Describe the bug
When I use nequip, which is installed by pip, in A100 machine, error will occur:
NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
To Reproduce
run nequip-train config/minimal.yaml
on A100 machine
Expected behavior
Properly running config/minimal.yaml and config/example.yaml
Environment (please complete the following information):
Additional context
when I tried: pip install --upgrade torch
to update torch to 2.0.0, problem seems to be solved and example running properly
Describe the bug
The tutorial colab example does not finish. The training continues forever, I stopped it after 2000 epochs.
To Reproduce
Run the colab notebook referenced in https://github.com/mir-group/nequip#tutorial
Expected behavior
The training used to stop after 100 epocs, now it runs forever.
Describe the bug
NequIP calculator fails with device='cuda'
but works with device='cpu'
To Reproduce
at = Atoms(...)
calc = nequip.from_deployed_model('model.pth', device='cuda')
at.calc = calc
at.get_forces()
Expected behavior
Calculator works with either CUDA or CPU device
Environment (please complete the following information):
The stack trace is:
/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/nequip/scripts/deploy.py:109: UserWarning: Loaded model had a different value for allow_tf32 than was currently set; changing the GLOBAL setting!
warnings.warn(
/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/nequip/scripts/deploy.py:120: UserWarning: Loaded model had a different value for _jit_bailout_depth than was currently set; changing the GLOBAL setting!
warnings.warn(
Traceback (most recent call last):
File "/home/mphuthi/Li/models/nequip/run1-6_v0_i0-1_low/nequip_test.py", line 12, in <module>
ph = nq.phonon_bands_and_dos(
File "/home/mphuthi/dev/calctest/calctest/calctest.py", line 812, in phonon_bands_and_dos
ph.run()
File "/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/ase/phonons.py", line 201, in run
result = self.calculate(atoms_N, disp)
File "/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/ase/phonons.py", line 320, in calculate
forces = self(atoms_N)
File "/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/ase/phonons.py", line 317, in __call__
return atoms_N.get_forces()
File "/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/ase/atoms.py", line 788, in get_forces
forces = self._calc.get_forces(self)
File "/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/ase/calculators/abc.py", line 23, in get_forces
return self.get_property('forces', atoms)
File "/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/ase/calculators/calculator.py", line 737, in get_property
self.calculate(atoms, [name], system_changes)
File "/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/nequip/ase/nequip_calculator.py", line 115, in calculate
out = self.model(data)
File "/home/mphuthi/.conda/envs/nequip-stress/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: default_program(18): error: expected a ")"
default_program(21): error: expected a ";"
default_program(23): error: expression must have class type
default_program(24): error: expression must have class type
default_program(25): error: expression must have class type
default_program(26): error: expression must have class type
default_program(28): error: expression must have class type
7 errors detected in the compilation of "default_program".
nvrtc compilation failed:
#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)
template<typename T>
__device__ T maximum(T a, T b) {
return isnan(a) ? a : (a > b ? a : b);
}
template<typename T>
__device__ T minimum(T a, T b) {
return isnan(a) ? a : (a < b ? a : b);
}
extern "C" __global__
void fused_mul_div_sin_div_mul_mul(float* t_, float* t__, float* aten_mul, float* aten_mul_1, float* aten_sin, float* aten_div, float* aten_mul_2, float* const_self.model.energy_model.radial_basis.basis.bessel_weights) {
{
if (512 * blockIdx.x + threadIdx.x<464000 ? 1 : 0) {
float const_self.model.energy_model.radial_basis.basis.bessel_weights_1 = __ldg(const_self.model.energy_model.radial_basis.basis.bessel_weights + (512 * blockIdx.x + threadIdx.x) % 8);
float t___1 = __ldg(t__ + (512 * blockIdx.x + threadIdx.x) / 8);
aten_mul_2[512 * blockIdx.x + threadIdx.x] = const_self.model.energy_model.radial_basis.basis.bessel_weights_1 * t___1;
aten_div[512 * blockIdx.x + threadIdx.x] = (float)((double)(const_self.model.energy_model.radial_basis.basis.bessel_weights_1 * t___1) / 6.0);
aten_sin[512 * blockIdx.x + threadIdx.x] = sinf((float)((double)(const_self.model.energy_model.radial_basis.basis.bessel_weights_1 * t___1) / 6.0));
aten_mul_1[512 * blockIdx.x + threadIdx.x] = (float)((double)((sinf((float)((double)(const_self.model.energy_model.radial_basis.basis.bessel_weights_1 * t___1) / 6.0))) / t___1) * 0.3333333333333333);
float v = __ldg(t_ + (512 * blockIdx.x + threadIdx.x) / 8);
aten_mul[512 * blockIdx.x + threadIdx.x] = (float)((double)((sinf((float)((double)(const_self.model.energy_model.radial_basis.basis.bessel_weights_1 * t___1) / 6.0))) / t___1) * 0.3333333333333333) * v;
}
}
}
Hello,
I am not sure if it's the right place to ask this question. But I am getting the following error when running lammps with Nequip trained ML potential:
ERROR: Pair style NEQUIP requires newton pair off (src/pair_nequip.cpp:108)
What do you think could be the problem? I am using lammps-29Sep21 and nequip.0.5.5
Thanks!
Dear developers
I have a problem while training the MLP. I get large bumps for the MAE of the forces. I know that it is not unusual to get some bumps while training, but usually the error goes back to the original value before the bump, quite fast. This is does not happen, it takes a few hours and in certain cases a new bump occurs befor reaching the old optimal error. See for example the figure below.
I know that I could change for example learning rate, or batch size or even restart the training from the optimal value before the bump. But I was wondering if this is something you saw while training MLPs or if you know what might cause this?
I ask because I am using almost the default settings of the "full.yaml" you provided (I only changed r_max to 6.0), so I would think that the settings are already quite good. However, I got this strange behavior for two different systems (CsPbI3 and FAPbI3) and two different training sets sizes (300 and 15000 structures). In the zip.file you can find the full.yaml file and the logs of the training belonging to the figure above.
Kind regards
Tom
Describe the bug
The example config file includes this line:
metrics_key: validation_loss # metrics used for scheduling and saving best model. Options: `set`_`quantity`, set can be either "train" or "validation, "quantity" can be loss or anything that appears in the validation batch step header, such as f_mae, f_rmse, e_mae, e_rmse
Following those instructions, I set the value to train_loss
. When I do, it fails with the exception
RuntimeError: metrics_key should start with either validation or training
Apparently it actually wants the value to be training_loss
instead of train_loss
. But when I change it to that, it fails with a different exception:
KeyError: 'training_loss'
It seems that some parts of the code expect one and other parts expect the other, so that neither works.
Environment (please complete the following information):
python --version
) 3.9import nequip; nequip.__version__
) 0.5.4import e3nn; e3nn.__version__
) 0.4.4import torch; torch.__version__
) 1.10.0nvcc --version
)import torch; torch.version.cuda
)Describe the bug
I am trying to reproduce the MRS2021 tutorial, but with the ASE interface. It worked in the colab notebook, but I get a TorchScript error when I try to install on our cluster here at DTU Physics. Nequip is installed into a venv from the developer branch with pip -e nequip/
. PyTorch is installed on the cluster using EasyBuild, it is version 1.9.0 built with the foss/2020b toolchain. The pip-installed version does not support our GPUs.
To Reproduce
The model was trained with the following script
rm -rf ./results
nequip-train ../nequip/configs/example.yaml
nequip-deploy build results/toluene/example-run-toluene toluene-deployed.pth
nequip-evaluate --train-dir results/toluene/example-run-toluene --batch-size 50
The attached script was then run, it produced the attach error message.
Expected behavior
No crash :-)
Environment (please complete the following information):
Additional context
Add any other context about the problem here
slurm-4259922.log
.
runoptimize.py.txt
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.