torchmd / torchmd-net Goto Github PK

View Code? Open in Web Editor NEW

297.0 297.0 69.0 183.25 MB

Training neural network potentials

License: MIT License

Python 70.47% Jupyter Notebook 19.54% Cuda 7.06% C++ 2.88% Makefile 0.04%

energy-functions equivariant-representations molecular-dynamics neural-networks transformer

torchmd-net's Introduction

TorchMD

About

TorchMD intends to provide a simple to use API for performing molecular dynamics using PyTorch. This enables researchers to more rapidly do research in force-field development as well as integrate seamlessly neural network potentials into the dynamics, with the simplicity and power of PyTorch.

TorchMD uses chemical units consistent with classical MD codes such as ACEMD, namely kcal/mol for energies, K for temperatures, g/mol for masses, and Å for distances.

TorchMD is currently WIP so feel free to provide feedback on the API or potential bugs in the GitHub issue tracker.

Also check TorchMD-Net for fast and accurate neural network potentials https://github.com/torchmd/torchmd-net/

Citation

Please cite:

@misc{doerr2020torchmd,
      title={TorchMD: A deep learning framework for molecular simulations},
      author={Stefan Doerr and Maciej Majewsk and Adrià Pérez and Andreas Krämer and Cecilia Clementi and Frank Noe and Toni Giorgino and Gianni De Fabritiis},
      year={2020},
      eprint={2012.12106},
      archivePrefix={arXiv},
      primaryClass={physics.chem-ph}
}

To reproduce the paper go to the tutorial notebook https://github.com/torchmd/torchmd-cg/blob/master/tutorial/Chignolin_Coarse-Grained_Tutorial.ipynb

License

Note. All the code in this repository is MIT, however we use several file format readers that are taken from Moleculekit which has a free open source non-for-profit, research license. This is mainly in torchmd/run.py. Moleculekit is installed automatically being in the requirement file. Check out Moleculekit here: https://github.com/Acellera/moleculekit

Installation

We recommend installing TorchMD in a new python environment ideally through the Miniforge package manager.

mamba create -n torchmd
mamba activate torchmd
mamba install pytorch python=3.10 -c conda-forge
mamba install moleculekit parmed jupyter -c acellera -c conda-forge # For running the examples
pip install torchmd

Examples

Various examples can be found in the examples folder on how to perform dynamics using TorchMD.

Help and comments

Please use the github issue of this repository.

Acknowledgements

We would like to acknowledge funding by the Chan Zuckerberg Initiative and Acellera in support of this project. This project will be now developed in collaboration with openMM (www.openmm.org) and acemd (www.acellera.com/acemd).

torchmd-net's People

Contributors

Stargazers

Watchers

Forkers

clementigroup truongchien shuaijiang-ustc raimis aguljas hozeren seunghoon-yi bluehope lsnty5190 yuanqing-wang sebastianmdick sailfish009 jeherr mikemhenry hyunp2 fastscience-ai gaoshan2006 dot23 basveeling avasquee paulie-ai yzhang1996 learner-nkh peastman spiliosd tengjieksee yuchanpei nishi-acog jasonwei2014 tianjuchen joschkn tmukande-debug tomatoisfruit giaguaro raulppelaez j-fabila antoniomirarchi felixmusil lc4lc utkarshp1161 moleorbitalhybridanalyst drmaruyama ritesh001 andrewrgarcia xuqiu48 ai-and-ml guillemsimeon bedledl playmolecule yytyytyytyytyyt kenko911 panxl shunsunsun intelligencethink junya737 oceantalk ryanliu30 ntq1982 shenoynikhil amelie-iska molecular-bionics-labs sef43 brian8128 mixarcid v-cyberpunk-01 parksamjong

torchmd-net's Issues

Inconsistency between gradients and forces

When training with forces, the model computes the negative gradient of the energy with respect to the positions (a.k.a. forces):

torchmd-net/torchmdnet/models/model.py

Lines 196 to 207 in db72e12

    
           if self.derivative: 
        
               grad_outputs: List[Optional[torch.Tensor]] = [torch.ones_like(out)] 
        
               dy = grad( 
        
                   [out], 
        
                   [pos], 
        
                   grad_outputs=grad_outputs, 
        
                   create_graph=True, 
        
                   retain_graph=True, 
        
               )[0] 
        
               if dy is None: 
        
                   raise RuntimeError("Autograd returned None for the force prediction.") 
        
               return out, -dy

Some the loaders load forces:

torchmd-net/torchmdnet/datasets/ani.py

Lines 279 to 281 in db72e12

    
           all_dy = pt.tensor( 
        
               mol["wb97x_dz.forces"][:] * self.HARTREE_TO_EV, dtype=pt.float32 
        
           )

torchmd-net/torchmdnet/datasets/comp6.py

Lines 100 to 102 in db72e12

    
           all_dy = pt.tensor( 
        
               mol["forces"][:] * self.HARTREE_TO_EV, dtype=pt.float32 
        
           )

While the other ones load gradients (the opposite sign):

torchmd-net/torchmdnet/datasets/qm9q.py

Lines 147 to 151 in db72e12

    
           dy = ( 
        
               pt.tensor(mol["gradient_vector"][conf], dtype=pt.float32) 
        
               * self.HARTREE_TO_EV 
        
               / self.BORH_TO_ANGSTROM 
        
           )

torchmd-net/torchmdnet/datasets/spice.py

Lines 99 to 103 in db72e12

    
           all_dy = ( 
        
               pt.tensor(mol["dft_total_gradient"], dtype=pt.float32) 
        
               * self.HARTREE_TO_EV 
        
               / self.BORH_TO_ANGSTROM 
        
           )

We need to agree what dy represents.

Pre-trained model

We are writing a paper about NNP/MM in ACEMD. So far, we have used ANI-2x for protein-ligand simulations, but to demonstrate a general utility, it would be good to include one more NNP.

Would it be possible to have a pre-trained TorchMD-NET model?

Broken repr

Version: 0.1.2

import torchmdnet
m = torchmdnet.models.torchmd_gn.TorchMD_GN()
print(m)
  File "<stdin>", line 1, in <module>
  File "/shared/raimis/opt/miniconda/envs/kdeep_2/lib/python3.8/site-packages/torchmdnet/models/torchmd_gn.py", line 118, in __repr__
    return (f'{self.__class__.__name__}('
  File "/shared/raimis/opt/miniconda/envs/kdeep_2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 947, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'TorchMD_GN' object has no attribute 'derivative'

Molecules for bechmarking speed

As we have already started optimizing (#45) of TorchMD-Net, it would be good to agree a set of molecules for benchmarks. Otherwise different developer will start using different molecules and it will be hard to compare.

Ideally, we need a set of molecules spanning the size of 10-1000 atoms.

AttributeError: can't set attribute

When I try to train a model, it fails immediately with an exception:

Traceback (most recent call last):
  File "/home/peastman/workspace/torchmd-net/scripts/train.py", line 147, in <module>
    main()
  File "/home/peastman/workspace/torchmd-net/scripts/train.py", line 106, in main
    model = LNNP(args)
  File "/home/peastman/workspace/torchmd-net/torchmdnet/module.py", line 22, in __init__
    self.hparams = hparams
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/torch/nn/modules/module.py", line 995, in __setattr__
    object.__setattr__(self, name, value)
AttributeError: can't set attribute

This happens because LightningModule defines hparams as a @property with only a getter, not a setter.

Access SPICE charges

Is there currently a way to access charges from the SPICE dataset class? As the code already supports passing charges to the model (through the q parameter), wouldn't it make sense to utilize that for SPICE? As I understand it doesn't make much sense to train models on SPICE without encoding the charge? @raimis @peastman

How to continue training with new settings?

After training a model, is there any way to do a new training run, starting from the result of the previous run? I know I can use the load_model option, but that resumes an existing run from a checkpoint with exactly the same settings used before. Here are the sorts of things I'd like to be able to do:

Train on one dataset, then fine tune on another dataset.
Initially train just on energies, then fine tune with forces as well.
Train in reduced precision (half or TF32), then fine tune in full precision.

Physics based priors

I want to implement some physics based priors. An example would be a Coulomb interaction based on pre-computed partial charges that are stored in the dataset. BasePrior.forward() is supposed to return a list of per-atom energy contributions, but physics based interactions usually do not decompose in that way. It would be much easier if it could just return a total energy for each sample.

What do you recommend as the cleanest way of implementing this?

Creating a custom dataset

I want to train a model on a custom dataset. I'm trying to follow the example at https://github.com/torchmd/torchmd-cg/blob/master/tutorial/Chignolin_Coarse-Grained_Tutorial.ipynb, but my data is different enough that it isn't quite clear how I should format it.

My datasets consist of many molecules of different sizes. For each molecule I have

an array of atom type indices
an array of atom coordinates
a potential energy
(optional) an array of forces on atoms

This differs from the tutorial in a few critical ways. My molecules are all different sizes, so I can't just put everything into rectangular arrays. And the training data is different: sometimes I will have only energies, and sometimes I will have both forces and energies which should be trained on together. The example trains only on forces with no energies.

Any guidance would be appreciated!

Broken links in README.md

Hi, I'm going through the README and some of the links and instructions (e.g. the usage) seems outdated/broken. Is it possible to fix them?

Repo to move to github.com/torchmd and become public

As some of you might have branches with your code, I just let you know that sooner or later this will go public.

ciao,
gianni

`CosineCutoff` has drastically different behavior when the lower cutoff is not zero

Recently when experimenting with SchNet model training with different cutoff settings, we observed an unexpected behavior:
The CosineCutoff in CFConv introduces a very different cutoff mask when we change the lower cutoff from zero to some small number (e.g., 1 Angstrom).
This can be understood as a design choice, but it seems unnecessary and sometimes even harmful, especially when ExpNormal flavor of RBFs are used: the signal for short distances is significantly reduced for a nonzero lower cutoff, which is quite different from the situation when lower cutoff is zero.
In fact, in ExpNormalSmearing itself the same cutoff function (with a constant lower cutoff at zero) is already applied. In other words, the current implementation seems to apply the cutoff twice.

Problems when run example QM9

Command:
CUDA_VISIBLE_DEVICES=0 python scripts/train.py --conf examples/ET-QM9.yaml --dataset QM9 --log-dir output/

Output:
File "/benchmark/science-computing/torchmd-net/torchmdnet/utils.py", line 153, in __call__ raise ValueError(f"Unknown argument in config file: {key}") ValueError: Unknown argument in config file: distributed_backend

Any problems for the example?

Optimization of the graph network

Optimization of the graph network (TorchMD_GN) with NNPOps (https://github.com/openmm/NNPOps).

In a special case, TorchMD_GN is equivalent to SchNet (#45 (comment)), which is already supported by NNPOps:

TorchMD_GN(rbf_type="gauss", trainable_rbf=False, activation="ssp", neighbor_embedding=False)

Implement PyTorch wrapper for CFConvNeighbors and CFConv -- openmm/NNPOps#40
Accelerate the limited TorchMD_GN with NNPOps -- #50
Update the installation instructions -- #55
- NNPOps package -- openmm/NNPOps#26
- PyTorch Geometric package -- #53

In general, TorchMD_GN needs these:

TorchMD_GN(rbf_type="expnorm", trainable_rbf=True, activation="silu", neighbor_embedding=True)

Implement the exponentially-modified Gaussian in CFConv (rbf_type="expnorm")
Allow to pass arbitrary RBF positions to CFConv (trainable_rbf=True)
Implement the SILU activation in CFConv (activation="silu")
Reuse CFConv to accelerate the neighbor embedding (neighbor_embedding=True)

NNPOps tests broken

@raimis the NNPOps tests seem to be broken, could you look into it?
E.g. in PR #84: https://github.com/torchmd/torchmd-net/pull/84/checks

The loss is NaN when ET models is set "distance_influence" to none

@raimis @PhilippThoelke

Hi, authors!

I recently trained the QM9 dataset with your powerful ET model, and I found that when "distance_influence" is set to "keys" or "values" or "both", the ET model works well. However, when I set "distance_influence" to "none", the loss becomes NaN in the back propagation of the first batch.

Did you notice this phenomenon during your training? How can I fix it? I'm really curious to see how powerful your novel attention mechanism is.

Here is my training commands and results:

CUDA_VISIBLE_DEVICES=0 python scripts/train.py --conf examples/ET-QM9.yaml --dataset QM9 --log-dir ./edge-att-output/ --dataset-root ./data
CUDA_VISIBLE_DEVICES=0 python scripts/train.py --conf examples/ET-QM9.yaml --dataset QM9 --log-dir ./keys-edge-att-output/ --distance-influence keys --dataset-root=./data

They both work well.

CUDA_VISIBLE_DEVICES=0 python scripts/train.py --conf examples/ET-QM9.yaml --dataset QM9 --log-dir ./no-edge-att-output/ --distance-influence none --dataset-root=./data

When running the above command, I got the NaN loss:

All experiments are running on a single V100 GPU.

I would be very grateful for your help! Thanks！

Ability to roll back an epoch

I've been experimenting with training protocols to see what gives the best results. So far, the best I've found is to start with a large learning rate that pushes the boundaries of what's stable, then aggressively reduce it as needed. This seems to consistently give faster learning and a better final result than starting from a lower learning rate or reducing it more slowly.

Sometimes, though, the signal that your learning rate is too high can be quite dramatic. It isn't just that it fails to learn, but that the loss suddenly jumps way up. Here's an example from a recent training run. Notice how at epoch 30 the training loss and validation loss both increased by several times. It then took a few epochs to realize it needed to reduce the learning rate, and 10 epochs before the loss got back to where it had been before.

26,0.00039999998989515007,9832.5087890625,10538.8203125,5701.85009765625,4130.658203125,3393.247802734375,7145.572265625,30023
27,0.00039999998989515007,9179.70703125,12511.1748046875,5192.06591796875,3987.641357421875,5366.404296875,7144.77099609375,31135
28,0.00039999998989515007,8929.75390625,9831.01171875,5093.1494140625,3836.604736328125,2955.695556640625,6875.3173828125,32247
29,0.00039999998989515007,8372.0537109375,9940.771484375,4676.4296875,3695.62353515625,3155.14306640625,6785.62890625,33359
30,0.00039999998989515007,310591.90625,31877.890625,286285.375,24306.5078125,14621.484375,17256.404296875,34471
31,0.00039999998989515007,23531.33984375,27256.47265625,12166.06640625,11365.2734375,14695.8525390625,12560.6220703125,35583
32,0.00039999998989515007,18280.423828125,17700.439453125,9990.2021484375,8290.2216796875,7149.82421875,10550.6142578125,36695
33,0.00039999998989515007,14477.421875,14724.962890625,7723.6826171875,6753.73876953125,5242.77490234375,9482.1884765625,37807
34,0.00031999999191612005,12286.7001953125,13225.00390625,6404.654296875,5882.0458984375,4444.48876953125,8780.513671875,38919
35,0.00031999999191612005,11554.501953125,16997.564453125,6239.15966796875,5315.341796875,8691.9501953125,8305.61328125,40031
36,0.00031999999191612005,10664.4365234375,16877.546875,5792.0390625,4872.3974609375,9064.205078125,7813.341796875,41143
37,0.00031999999191612005,9997.931640625,11185.5439453125,5469.81396484375,4528.11767578125,3710.89599609375,7474.64794921875,42255
38,0.00025599999935366213,8595.7841796875,10532.455078125,4355.873046875,4239.9111328125,3280.87353515625,7251.58154296875,43367
39,0.00025599999935366213,8495.14453125,14786.5244140625,4462.298828125,4032.845458984375,7734.66015625,7051.86474609375,44479
40,0.00025599999935366213,7973.798828125,10847.546875,4123.08740234375,3850.7109375,3955.357177734375,6892.18994140625,45591

What do you think about adding an option to detect this instability by checking for the training loss increasing by more than a specified amount? When that happened, it would undo the epoch, rolling the model parameters back to the end of the previous epoch, and immediately reduce the learning rate.

Accelerate the limited `TorchMD_GN` with NNPOps

Reproducing SoTA results for QM9

Hello Author,

I noticed that your powerful ET model achieves SoTA results on 5 tasks in QM9, and I would like to reproduce the results.

Unfortunately, when I adopted the QM9 template from your repo and turn on/off Atomref, Standardization and change the output modules accordingly, the reproduced results (after early stopping) are still much lower than those reported in the paper.

For example, the result of dipole moment MAE is 0.01 vs 0.0002, the result of electronic spatial extent is 0.035 vs 0.015.

I'm curious whether you have adjusted the hyper-parameters for each property. Can you share the hyper-parameters that can reproduce the results with me?

Thanks for your help!

Versioning

It would be good to have code versioning for reproducibility.

Periodic boundary conditions

I want to try applying a model to a larger system that uses periodic boundary conditions. Currently that isn't supported. It looks to me like the only class that would need to be modified is Distance? But it implements the calculation with torch_cluster.radius_graph(), which doesn't support periodic boundary conditions. So either we would need to add that feature to torch_cluster, or else we would need to rewrite that class to work differently.

Any suggestions on the best way to approach this?

radius graph defaults

Hello everyone! The tools are great so far. I noticed that there is a small but important corner case for the usage of radius_graph from torch_cluster. When searching for neighbors, the behavior is governed by the cutoff distance r and max_num_neighbors (see docs here: https://github.com/rusty1s/pytorch_cluster#radius-graph). The latter is set to a maximum of 32 neighbors for each node. If, for example, the user inputs a large cutoff distance intending to return all neighbors, they will still be truncated at a maximum of 32 even if the user expects more. Furthermore, I'm not sure how radius_graph decides to reject extra neighbors, or how example shuffling during training affects this - for my usage case it seems to make a big difference in the training and inference. Because the SchNet philosophy is to operate on the notion of cutoff distances, not maximum neighbors, would it make sense to add a kwarg to the TorchMD_GN.__init__() raise the limit of the max neighbors for this operation?

Of course, most users probably will not run into this problem if they stick to small cutoffs because they will never hit the upper neighbor ceiling. However, I would be happy to branch, implement this, write tests, and make a PR if it seems like a good idea.

Add W&B

I have talked to Albert to help

`test_hdf5_multiprocessing` fails randomly

test_hdf5_multiprocessing test is failing radomly:

Error message:

__________________________ test_hdf5_multiprocessing ___________________________

tmpdir = local('/tmp/pytest-of-runner/pytest-0/test_hdf5_multiprocessing0')
num_entries = 100

    def test_hdf5_multiprocessing(tmpdir, num_entries=100):
        # generate sample data
        z = np.zeros(num_entries)
        pos = np.zeros(num_entries * 3).reshape(-1, 3)
        energy = np.zeros(num_entries)
    
        # write the dataset
        data = h5py.File(join(tmpdir, "test_hdf5_multiprocessing.h5"), mode="w")
        group = data.create_group("group")
        group["types"] = z[:, None]
        group["pos"] = pos[:, None]
        group["energy"] = energy
        data.flush()
        data.close()
    
        # load the dataset using the HDF5 class and multiprocessing
        dset = HDF5(join(tmpdir, "test_hdf5_multiprocessing.h5"))
        with mp.Pool(2) as p:
            result = p.map(get_hdf5_file_id, [dset, dset])
>       assert result[0] != result[1], "Both processes received the same h5py File instance"
E       AssertionError: Both processes received the same h5py File instance
E       assert 140491698696144 != 140491698696144

tests/test_datasets.py:75: AssertionError

The test was introduced by #84, while the CI was broken.

Unable to reproduce the results of ET on QM9

I am trying to reproduce the results in the paper using equivariant transformers on QM9 to predict the energy. Here is my config, the hyperparameters are the same as those mentioned in appendix A in "EQUIVARIANT TRANSFORMERS FOR NEURAL NETWORK-BASED MOLECULAR POTENTIALS".

activation: silu
aggr: add
atom_filter: -1
attn_activation: silu
batch_size: 128
coord_files: null
cutoff_lower: 0.0
cutoff_upper: 5.0
dataset: QM9
dataset_arg: energy_U0
dataset_root: ~/data
derivative: false
distance_influence: both
distributed_backend: ddp_cpu
early_stopping_patience: 300
ema_alpha_dy: 1.0
ema_alpha_y: 1.0
embed_files: null
embedding_dimension: 256
energy_files: null
energy_weight: 1.0
force_files: null
force_weight: 1.0
inference_batch_size: 128
load_model: null
log_dir: /tmp/logs
lr: 0.0004
lr_factor: 0.8
lr_min: 1.0e-06
lr_patience: 15
lr_warmup_steps: 10000
max_num_neighbors: 32
max_z: 100
neighbor_embedding: true
ngpus: 1
num_epochs: 300
num_heads: 8
num_layers: 8
num_nodes: 1
num_rbf: 64
num_workers: 4
output_model: Scalar
precision: 32
prior_model: Atomref
rbf_type: expnorm
redirect: false
reduce_op: add
save_interval: 3
seed: 3
splits: null
standardize: false
test_interval: 3
test_size: null
train_size: 110000
trainable_rbf: false
val_size: 10000
weight_decay: 0.0

However, I was unable to reproduce the results in the paper.

The model I constructed only has 2.5 M parameters, while according to appendix A, the model in the paper with the same hyperparameters has 6.87M parameters.
The best test error I got was 12meV, which is significantly higher than 6.24meV reported in the paper.

Bias terms in the Equivariant Block and Attention Mechanism Implementation

Dear Philipp

I've been looking at the code for this implementation in order to reproduce some of the results, and I noticed that the vector linear projections used in the Gated Equivariant Blocks and the Attention blocks don't have their bias terms removed. This seems to deviate from the descriptions from your paper and the PaiNN papers.

torchmd-net/torchmdnet/models/utils.py

Lines 260 to 261 in e23c178

    
           self.vec1_proj = nn.Linear(hidden_channels, hidden_channels) 
        
           self.vec2_proj = nn.Linear(hidden_channels, out_channels)

https://github.com/torchmd/torchmd-net/blob/main/torchmdnet/models/torchmd_et.py#L224

Unfortunately these bias terms would prevent the model implementation from being rotationally invariant, since you will be adding a vector in the (1,1,1) direction to all the vector features whenever it is called.

In my own implementation, it seems like the gated equivariant blocks had a propensity to NaN during training if this bias term is removed, since it is called immediately before a L2 Norm. I'm still investigating this, however.

All the best,

Charlie

TorchScript support

TorchMD_GN cannot be converted to TorchScript:

import torch
from torchmdnet.models.torchmd_gn import TorchMD_GN

model = TorchMD_GN()
torch.jit.script(model)

Traceback (most recent call last):
  File "tmn_jit.py", line 5, in <module>
    torch.jit.script(model)
  File "/shared/raimis/opt/miniconda/envs/tmn/lib/python3.8/site-packages/torch/jit/_script.py", line 942, in script
    return torch.jit._recursive.create_script_module(
  File "/shared/raimis/opt/miniconda/envs/tmn/lib/python3.8/site-packages/torch/jit/_recursive.py", line 391, in create_script_module
    return create_script_module_impl(nn_module, concrete_type, stubs_fn)
  File "/shared/raimis/opt/miniconda/envs/tmn/lib/python3.8/site-packages/torch/jit/_recursive.py", line 448, in create_script_module_impl
    script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
  File "/shared/raimis/opt/miniconda/envs/tmn/lib/python3.8/site-packages/torch/jit/_script.py", line 391, in _construct
    init_fn(script_module)
  File "/shared/raimis/opt/miniconda/envs/tmn/lib/python3.8/site-packages/torch/jit/_recursive.py", line 428, in init_fn
    scripted = create_script_module_impl(orig_value, sub_concrete_type, stubs_fn)
  File "/shared/raimis/opt/miniconda/envs/tmn/lib/python3.8/site-packages/torch/jit/_recursive.py", line 452, in create_script_module_impl
    create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
  File "/shared/raimis/opt/miniconda/envs/tmn/lib/python3.8/site-packages/torch/jit/_recursive.py", line 335, in create_methods_and_properties_from_stubs
    concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults)
  File "/shared/raimis/opt/miniconda/envs/tmn/lib/python3.8/site-packages/torch/jit/_recursive.py", line 757, in try_compile_fn
    return torch.jit.script(fn, _rcb=rcb)
  File "/shared/raimis/opt/miniconda/envs/tmn/lib/python3.8/site-packages/torch/jit/_script.py", line 989, in script
    fn = torch._C._jit_script_compile(
RuntimeError: 
python value of type 'module' cannot be used as a value. Perhaps it is a closed over global variable? If so, please consider passing it in as an argument or use a local varible instead.:
  File "/shared/raimis/opt/miniconda/envs/tmn/lib/python3.8/site-packages/torch_geometric/nn/pool/__init__.py", line 210
        edge_index = radius_graph(x, r=1.5, batch=batch, loop=False)
    """
    if torch_cluster is None:
       ~~~~~~~~~~~~~ <--- HERE
        raise ImportError('`radius_graph` requires `torch-cluster`.')
'radius_graph' is being compiled since it was called from 'Distance.forward'
  File "/shared/raimis/opt/miniconda/envs/tmn/lib/python3.8/site-packages/torchmdnet/models/utils.py", line 179
    def forward(self, pos, batch):
        edge_index = radius_graph(pos, r=self.cutoff_upper, batch=batch, loop=self.loop,
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                  max_num_neighbors=self.max_num_neighbors)
                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        edge_vec = pos[edge_index[0]] - pos[edge_index[1]]

PyTorch Geometric support TorchScript (https://pytorch-geometric.readthedocs.io/en/latest/notes/jit.html), but a few modifications are needed.

`aggr` attribute is not properly being set from `create_model`

I think that the aggr attribute (for CFConv output behavior) is not properly being set for TorchMD_GN type representations in create_model:

from https://github.com/compsciencelab/torchmd-net/blob/815db87e361f90852b6131eefebc94ef2ddda53d/torchmdnet/models/model.py#L37

    if args["model"] == "graph-network":
        is_equivariant = False
        representation_model = TorchMD_GN(
            num_filters=args["embedding_dimension"], **shared_args
        )

I think there should be a statement after num_filters that sets aggr, otherwise it always defaults to add regardless of what is in shared_args. I can create a PR to fix this shortly if needed.

NaN when fitting with derivative

I'm trying to fit an equivariant transformer model. If I specify derivative: true in the configuration file to use derivatives in fitting, then after only a few training steps the model output becomes nan. This happens even if I also specify force_weight: 0.0. The derivatives shouldn't affect the loss at all in that case, yet it still causes fitting to fail. The obvious explanation would be if I had a nan in the training data somewhere, since that would cause the loss to also be nan even after multiplying by 0. But I verified that's not the case. Immediately after it computes the loss

torchmd-net/torchmdnet/module.py

Line 89 in b9785d2

loss_dy = loss_fn(deriv, batch.dy)

I added

print(loss_dy, torch.all(torch.isfinite(deriv)), torch.all(torch.isfinite(batch.dy)))

Here's the relevant output from the log.

Epoch 0:   1%|          | 31/5483 [00:06<19:36,  4.63it/s, loss=1.28e+07, v_num=_]tensor(11670.3730, device='cuda:0', grad_fn=<MseLossBackward0>) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
Epoch 0:   1%|          | 32/5483 [00:06<19:32,  4.65it/s, loss=1.25e+07, v_num=_]tensor(273794.6562, device='cuda:0', grad_fn=<MseLossBackward0>) tensor(True, device='cuda:0') tensor(True, device='cuda:0')
Epoch 0:   1%|          | 33/5483 [00:07<19:28,  4.67it/s, loss=1.25e+07, v_num=_]tensor(nan, device='cuda:0', grad_fn=<MseLossBackward0>) tensor(False, device='cuda:0') tensor(True, device='cuda:0')
Epoch 0:   1%|          | 34/5483 [00:07<19:25,  4.68it/s, loss=nan, v_num=_]     tensor(nan, device='cuda:0', grad_fn=<MseLossBackward0>) tensor(False, device='cuda:0') tensor(True, device='cuda:0')

batch.dy never contains a non-finite value.

Any idea what could be causing this?

Cannot create env with mamba

I have been trying to use torchmd-net and torchmd-cg. I first tried installing torchmd-net following the instructions given. Tried to create the env with mamba and I encountered the following error. Help?

Encountered problems while solving:
  - nothing provides requested nnpops 0.2
  - nothing provides requested pytorch_cluster 1.5.9

How to create an environment?

I've been stuck for hours trying to figure out how to create an environment in which I can run this code. I start by creating a new environment:

$ conda create --name torchmd python=3.8
$ conda activate torchmd

I then try to install using the command in the readme, but it fails:

Processing /home/peastman/workspace/torchmd-net
Collecting torch
  Downloading torch-1.8.1-cp38-cp38-manylinux1_x86_64.whl (804.1 MB)
     |████████████████████████████████| 804.1 MB 11 kB/s 
Collecting pytorch-lightning
  Downloading pytorch_lightning-1.2.8-py3-none-any.whl (841 kB)
     |████████████████████████████████| 841 kB 1.9 MB/s 
Collecting torch-geometric
  Downloading torch_geometric-1.7.0.tar.gz (212 kB)
     |████████████████████████████████| 212 kB 1.8 MB/s 
Collecting torch-sparse
  Downloading torch_sparse-0.6.9.tar.gz (36 kB)
    ERROR: Command errored out with exit status 1:
     command: /home/peastman/miniconda3/envs/torchmd/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-0n_m0hsd/torch-sparse_8404983884744cbdbf7decfcfdfeb0f4/setup.py'"'"'; __file__='"'"'/tmp/pip-install-0n_m0hsd/torch-sparse_8404983884744cbdbf7decfcfdfeb0f4/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-sww8yhre
         cwd: /tmp/pip-install-0n_m0hsd/torch-sparse_8404983884744cbdbf7decfcfdfeb0f4/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-0n_m0hsd/torch-sparse_8404983884744cbdbf7decfcfdfeb0f4/setup.py", line 8, in <module>
        import torch
    ModuleNotFoundError: No module named 'torch'

As an alternative I tried using conda to install the dependencies:

conda install -c conda-forge pytorch_geometric

But that produces a broken environment. import torch_geometric fails with

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.8/site-packages/torch_geometric/__init__.py", line 2, in <module>
    import torch_geometric.nn
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.8/site-packages/torch_geometric/nn/__init__.py", line 2, in <module>
    from .data_parallel import DataParallel
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.8/site-packages/torch_geometric/nn/data_parallel.py", line 5, in <module>
    from torch_geometric.data import Batch
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.8/site-packages/torch_geometric/data/__init__.py", line 1, in <module>
    from .data import Data
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.8/site-packages/torch_geometric/data/data.py", line 7, in <module>
    from torch_sparse import coalesce, SparseTensor
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.8/site-packages/torch_sparse/__init__.py", line 12, in <module>
    torch.ops.load_library(importlib.machinery.PathFinder().find_spec(
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.8/site-packages/torch/_ops.py", line 105, in load_library
    ctypes.CDLL(path)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.8/ctypes/__init__.py", line 381, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/peastman/miniconda3/envs/torchmd/lib/python3.8/site-packages/torch_sparse/_version.so: undefined symbol: _ZN3c1017RegisterOperatorsC1Ev

This is on Ubuntu 16.04.

bad_alloc using PyTorch 1.12

I have a model created with TorchMD-Net. I want to use it for running a simulation in OpenMM. That involves compiling to TorchScript, saving to a file, and loading it with the PyTorch C++ API. When I try to do that, it crashes with a bad_alloc down inside libtorch.

Is this expected to work? Or do some of the packages like pyg and torch-cluster not support that workflow? If it's known not to work right now, what would need to happen to make it work?

Test loss not improving in most recent PyTorch Lightning version

There currently is a problem when using the most recent PyTorch lightning version where the test loss does not increase at all and validation and training loss exhibit spikes during the epochs when we evaluate on the test set.

The spikes in validation and training loss are always exactly the same value and the learning rate during these epochs gets increased to the initial learning rate. PyTorch Lightning likely changed something in the way trainer.run_test() works, which we are using as a workaround to allow evaluating the test set during training.

The test set is currently evaluated with:
https://github.com/compsciencelab/torchmd-net2/blob/755336ce23f5342157e753a0e04d744db09269e2/src/module.py#L142-L144

The problem does not occur in PyTorch Lightning version 1.1

Ways to reduce memory use

I'm trying to train equivariant transformer models on a GPU with 12 GB of memory. I can train small to medium sized models, but if I make it too large (for example, 6 layers with embedding dimension 96), CUDA runs out of device memory. Is there anything I can do to reduce the memory requirements? I already tried reducing the batch size but it didn't help.

Crashes while training

I just updated to the most recent code and I can no longer train models. It gets about 7% of the way through the first epoch and then crashes. Here's the full error message from the log.

  File "/home/peastman/workspace/torchmd-net/scripts/train.py", line 172, in <module>
    main()
  File "/home/peastman/workspace/torchmd-net/scripts/train.py", line 165, in main
    trainer.fit(model, data)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
    self.train_loop.run_training_epoch()
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 490, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 731, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 432, in optimizer_step
    using_lbfgs=is_lbfgs,
  File "/home/peastman/workspace/torchmd-net/torchmdnet/module.py", line 111, in optimizer_step
    super().optimizer_step(*args, **kwargs)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 329, in optimizer_step
    self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 336, in run_optimizer_step
    self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 193, in optimizer_step
    optimizer.step(closure=lambda_closure, **kwargs)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/torch/optim/adamw.py", line 65, in step
    loss = closure()
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 726, in train_step_and_backward_closure
    split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 814, in training_step_and_backward
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 280, in training_step
    training_step_output = self.trainer.accelerator.training_step(args)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 204, in training_step
    return self.training_type_plugin.training_step(*args)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 319, in training_step
    return self.model(*args, **kwargs)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/overrides/base.py", line 46, in forward
    output = self.module.training_step(*inputs, **kwargs)
  File "/home/peastman/workspace/torchmd-net/torchmdnet/module.py", line 44, in training_step
    return self.step(batch, mse_loss, 'train')
  File "/home/peastman/workspace/torchmd-net/torchmdnet/module.py", line 60, in step
    pred, deriv = self(batch.z, batch.pos, batch.batch)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/peastman/workspace/torchmd-net/torchmdnet/module.py", line 41, in forward
    return self.model(z, pos, batch=batch)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/peastman/workspace/torchmd-net/torchmdnet/models/output_modules.py", line 60, in forward
    x, z, pos, batch = self.representation_model(z, pos, batch=batch)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/peastman/workspace/torchmd-net/torchmdnet/models/torchmd_gn.py", line 112, in forward
    edge_index, edge_weight = self.distance(pos, batch)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/peastman/workspace/torchmd-net/torchmdnet/models/utils.py", line 182, in forward
    max_num_neighbors=self.max_num_neighbors)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/torch_cluster/radius.py", line 53, in radius_graph
    if batch_x is not None:
        assert x.size(0) == batch_x.numel()
        batch_size = int(batch_x.max()) + 1
                     ~~~ <--- HERE

        deg = x.new_zeros(batch_size, dtype=torch.long)
RuntimeError: CUDA error: device-side assert triggered

ANI1-equivariant_transformer cannot be converted to TorchScirpt

import torch
from torchmdnet.models.model import load_model

# Works
model = load_model('examples/pretrained/ANI1-transformer/epoch=109-val_loss=0.0008-test_loss=0.0180.ckpt')
torch.jit.script(model)

# Fails
model = load_model('examples/pretrained/ANI1-equivariant_transformer/epoch=209-val_loss=0.0003-test_loss=0.0093.ckpt')
torch.jit.script(model)

Traceback (most recent call last):
  File "debug.py", line 10, in <module>
    torch.jit.script(model)
  File "/home/user/conda/lib/python3.7/site-packages/torch/jit/_script.py", line 943, in script
    obj, torch.jit._recursive.infer_methods_to_compile
  File "/home/user/conda/lib/python3.7/site-packages/torch/jit/_recursive.py", line 391, in create_script_module
    return create_script_module_impl(nn_module, concrete_type, stubs_fn)
  File "/home/user/conda/lib/python3.7/site-packages/torch/jit/_recursive.py", line 448, in create_script_module_impl
    script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
  File "/home/user/conda/lib/python3.7/site-packages/torch/jit/_script.py", line 391, in _construct
    init_fn(script_module)
  File "/home/user/conda/lib/python3.7/site-packages/torch/jit/_recursive.py", line 428, in init_fn
    scripted = create_script_module_impl(orig_value, sub_concrete_type, stubs_fn)
  File "/home/user/conda/lib/python3.7/site-packages/torch/jit/_recursive.py", line 448, in create_script_module_impl
    script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
  File "/home/user/conda/lib/python3.7/site-packages/torch/jit/_script.py", line 391, in _construct
    init_fn(script_module)
  File "/home/user/conda/lib/python3.7/site-packages/torch/jit/_recursive.py", line 428, in init_fn
    scripted = create_script_module_impl(orig_value, sub_concrete_type, stubs_fn)
  File "/home/user/conda/lib/python3.7/site-packages/torch/jit/_recursive.py", line 448, in create_script_module_impl
    script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
  File "/home/user/conda/lib/python3.7/site-packages/torch/jit/_script.py", line 391, in _construct
    init_fn(script_module)
  File "/home/user/conda/lib/python3.7/site-packages/torch/jit/_recursive.py", line 428, in init_fn
    scripted = create_script_module_impl(orig_value, sub_concrete_type, stubs_fn)
  File "/home/user/conda/lib/python3.7/site-packages/torch/jit/_recursive.py", line 452, in create_script_module_impl
    create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
  File "/home/user/conda/lib/python3.7/site-packages/torch/jit/_recursive.py", line 335, in create_methods_and_properties_from_stubs
    concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults)
RuntimeError: 
Return value was annotated as having type Tensor but is actually of type Tuple[Tensor, Tensor]:
  File "/tmp/user_pyg_jit/tmpvk33vwft.py", line 225
        out = self.message(r_ij=kwargs.r_ij, k_j=kwargs.k_j, d_ij=kwargs.d_ij, vec_j=kwargs.vec_j, dk=kwargs.dk, dv=kwargs.dv, q_i=kwargs.q_i, v_j=kwargs.v_j)
        out = self.aggregate(out, index=kwargs.index, ptr=kwargs.ptr, dim_size=kwargs.dim_size)
        return self.update(out)
        ~~~~~~~~~~~~~~~~~~~~~~ <--- HERE

Unable to fit model

I've been trying to train a model on an early subset of the SPICE dataset. All my efforts so far have been unsuccessful. I must be doing something wrong, but I really don't know what. I'm hoping someone else can spot the problem. My configuration file is given below. Here's the HDF5 file for the dataset.

I've tried training with or without derivatives. I've tried a range of initial learning rates, with or without warmup. I've tried varying model parameters. I've tried restricting it to only molecules with no formal charges. Nothing makes any difference. In all cases, the loss starts out at about 2e7 and never decreases.

The dataset consists of all SPICE calculations that had been completed when I started working on this a couple of weeks ago. I converted the units so positions are in Angstroms and energies in kJ/mol. I also subtracted off per-atom energies. Atom types are the union of element and formal charge. Here's the mapping:

typeDict = {('Br', -1): 0, ('Br', 0): 1, ('C', -1): 2, ('C', 0): 3, ('C', 1): 4, ('Ca', 2): 5, ('Cl', -1): 6,
            ('Cl', 0): 7, ('F', -1): 8, ('F', 0): 9, ('H', 0): 10, ('I', -1): 11, ('I', 0): 12, ('K', 1): 13,
            ('Li', 1): 14, ('Mg', 2): 15, ('N', -1): 16, ('N', 0): 17, ('N', 1): 18, ('Na', 1): 19, ('O', -1): 20,
            ('O', 0): 21, ('O', 1): 22, ('P', 0): 23, ('P', 1): 24, ('S', -1): 25, ('S', 0): 26, ('S', 1): 27}

If anyone can provide insight, I'll be very grateful!

activation: silu
atom_filter: -1
batch_size: 128
cutoff_lower: 0.0
cutoff_upper: 8.0
dataset: HDF5
dataset_root: SPICE-corrected.hdf5
derivative: false
distributed_backend: ddp
early_stopping_patience: 40
embedding_dimension: 64
energy_weight: 1.0
force_weight: 0.001
inference_batch_size: 128
lr: 1.e-4
lr_factor: 0.8
lr_min: 1.e-7
lr_patience: 10
lr_warmup_steps: 5000
max_num_neighbors: 80
max_z: 28
model: equivariant-transformer
neighbor_embedding: true
ngpus: -1
num_epochs: 1000
num_heads: 8
num_layers: 5
num_nodes: 1
num_rbf: 64
num_workers: 4
rbf_type: expnorm
save_interval: 5
seed: 1
test_interval: 10
test_size: 0.01
trainable_rbf: true
val_size: 0.05
weight_decay: 0.0

Loading jittable models

Hello. After catching up to main, I am no longer able to load my models after training them. When calling torch.load() on a .pt file, I get the following error:

ModuleNotFoundError: No module named 'CFConvJittable_07e26a'

Is there a new procedure for loading models for prediction/simulation?

Reproduding QM9 Results

I am training ET on QM9, trying to reproduce the results reported in the paper.
I used the parameters in ET-QM9.yaml without tuning, and the performances for HOMO and R2 are as follows.

Is there some recommendation for parameters selection?

Clarifications of the method

Hello, after reading the paper, I had several questions regarding your approach. Thanks a lot in advance for taking the time to answer them.

Your embedding layer is more complex than usual: your initial node representation already seems to depend on its neighbour’s representation.

Is this beneficial ? Have you done experiments to show it ?

Graph construction: you use a smooth cutoff function and describe some benefits. You describe a Transformers but still use a cutoff value.

Is that statement correct ? Why ? So we do not capture long-range dependencies, right ? Is the smooth cutoff beneficial — you have seen something empirically to either motivate it or show its benefits ?

You say the feature vector are passed through a normalization layer.

Can you explain ? Including some motivation maybe.

An intermediate node embedding (y_i) utilising attention scores is created and impact final x_i and v_i embeddings. This step weights a projection of each neighbor’s representation ~ $a_{ij} (W \cdot RBF(d_{ij}) \cdot \vec{V}_j)$ by the attention score.

You use interatomic distances twice, don’t you ? Is weighting only by attention not enough theoretically ?

The equivariant message m_ij (component of sum to obtain w_i) is obtained by multiplying s_ij^2 (i.e. v_j scaled by RBF(d_ij)) by the directional info r_ij; then adding to it s_ij^1 (i.e. v_j scaled by RBF(d_ij)) re-multiplied by v_j.

Do you think that multiplying the message sequentially by distance info and directional info is the best choice to embed both info. type ? Why not concatenate r_ij (r_i - r_j) and d_ij (norm of r_ij = distance) info and have a single operation for instance ?
Is multiplying s_ij^1 by v_j (again) necessary ? (first in s_ij then by multiplying element-wise s_ij to v_j)
IMPORTANT. r_ij has dimension 3 while s_ij^2 has dimension F. In Eq (11), how can you apply an element-wise multiplication ? Is it a typo ? How exactly do you combine these two quantities ? What’s your take on the best way to combine 3D info (directional vector) with existing embedding ? This is a true question I am interested in, if you have references or insights on this bit…

Invariant representation involves the scalar product of the equivariant vector v_i, projected with matrix U1 by (U2 v_i).

What is the real benefit / aim of this scalar product ? Is a unique projection not enough ?

Additional aggregation options for message passing/cfconv

Hello all - I think it may be beneficial to expand the default option of 'add' for the aggregation scheme used in the message passing/CFconv. A good example is when you want the filter response to be independent of the molecule size (eg, using 'mean' instead of 'add'). See https://pytorch-geometric.readthedocs.io/en/latest/notes/create_gnn.html for more details. The original SchNet paper also divided the filter output by the molecule size. I will implement this and create a PR if you all think its a good idea to include.

Distance scaling in ExpNorm RBFs

I have noticed that the ExpNorm basis functions might benefit from a new parameter that rescales the distances inside the inner expoential. Consider the default setup where the distances are not rescaled and an upper cutoff of 35 angstroms is used (the first figure). While there is a good density of basis functions at small distances, there are just a few at longer distances. By contrast, if the distances are rescaled by some factor, say alpha=0.2. the second figure results in a more even distribution of basis functions (without changing anything else). I think this might be at odds with how the distance expansions are implemented in the TorchMD_GN class, because it always expects a fixed number of parameters common to all distance expansion types. If this is desirable, I would be happy to make a PR and implement tests.

Reproducing results for QM9

Hi! I have recently been trying to reproduce your results for QM9 and have some questions. I forked your code this week and started a run on QM9 for predicting the HOMO energy. I only changed dataset-arg and log-dir, otherwise everything is identical to the latest version of your code I think. The command I used is:

CUDA_VISIBLE_DEVICES=0,1 python scripts/train.py --conf examples/ET-QM9.yaml --dataset QM9 --log-dir output/ --dataset-arg homo

I matched the GPU type and count to your paper: I launched over two RTX 2080 Ti's.

However, I only seem to achieve 26 meV for the HOMO test MAE. Do you have any suggestions for what might be differing in my setup compared to yours? Do I need to change any of the other configs in examples/ET-QM9.yaml to train on HOMO? How much variance should I expect across seeds?

Thank you for your help! 🙂

Sheh

Test loss

Which function is supposed to compute the test loss? Can we clean it up?

torchmd-net/torchmdnet/module.py

Lines 61 to 69 in 2f6e190

    
           def validation_step(self, batch, batch_idx, *args): 
        
               if len(args) == 0 or (len(args) > 0 and args[0] == 0): 
        
                   # validation step 
        
                   return self.step(batch, mse_loss, "val") 
        
               # test step 
        
               return self.step(batch, l1_loss, "test") 
        
           def test_step(self, batch, batch_idx): 
        
               return self.step(batch, l1_loss, "test")

Ping: @PhilippThoelke @stefdoerr

how to get the test result?

I have finished the training step, and have saved the checkpoints. However, the TEST RESULTS is an empty dict.

An error occurred in setting up the environment

I set up the environment according to the installation manual. Every time I create a virtual environment, cudatoolkit11.7.0 will be downloaded automatically. However, my Linux cuda version is 11.6, running torchmd-train will report such an error, can you solve it? Thank you very much!

Problem creating an environment

I install torchmd-net as shown in README.md. When I run examples with the command:
CUDA_VISIBLE_DEVICES=0 python scripts/train.py --conf examples/ET-QM9.yaml,
it reports the following errors:

Can anyone know the reasons?

Question about adding missing atoms in protein

Hi, awesome work! I meet some partly broken protein (with some missing side chain atoms) when dealing with pdb file and I usually use pdbfixer(https://github.com/openmm/pdbfixer) or pdb2pqr(https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/midas/pdb2pqr.html) to repair them. After reading your paper. I just wonder it is possible to train a pdb-fix-net based on torchmd-net. Could you please give me some tips? Thanks~:)

Open dataset in setup and/instead of init

Currently the QM9 dataset is opened and assigned to the lightning module inside __init__: https://github.com/compsciencelab/torchmd-net2/blob/119d41d94db27074b8299e531cf5788bea94f311/src/module.py#L22-L22

Usually in PyTorch Lightning this is done inside setup to ensure that each worker opens the dataset separately. However, the model should be created inside __init__ and in the case of QM9 it requires to set the atomref, which comes from the QM9 dataset, already at construction. As a workaround we could open the dataset once inside __init__ and discard it just to reopen it inside setup. Although better than the current version, I don't think this is an optimal solution to the problem.

Optimize equivariant transformer

I've started looking into how to improve the speed of the equivariant transformer. I'm opening this issue as a place to report my progress.

I'm mainly interested in the speed of running simulations with it: predicting energy and forces for a single molecule. I'm less concerned about training speed. As a starting point, I tried loading molecules of various sizes and computing forces. For the small molecules we're mainly interested in, the time is completely dominated by overhead and barely affected by the number of atoms. It ranged from about 11.5 ms for molecules with around 20 atoms, up to 13.1 ms for one with 94 atoms.

Viewing it with the PyTorch profiler shows the GPU running lots of tiny kernels with long gaps in between. The GPU is sitting idle most of the time. In contrast, if I profile it during training with a batch size of 64, the GPU usage is very high. There might still be ways to make it faster, but at least operating on batches keeps the GPU from being idle.

There are two possible approaches to making it faster: improve the PyTorch code, or just replace it with custom CUDA kernels. The latter is likely to be more effective, so that's what I'm inclined to do. The custom kernels would be added to NNPOps.

	if self.derivative:
	grad_outputs: List[Optional[torch.Tensor]] = [torch.ones_like(out)]
	dy = grad(
	[out],
	[pos],
	grad_outputs=grad_outputs,
	create_graph=True,
	retain_graph=True,
	)[0]
	if dy is None:
	raise RuntimeError("Autograd returned None for the force prediction.")
	return out, -dy

	all_dy = pt.tensor(
	mol["wb97x_dz.forces"][:] * self.HARTREE_TO_EV, dtype=pt.float32
	)

	all_dy = pt.tensor(
	mol["forces"][:] * self.HARTREE_TO_EV, dtype=pt.float32
	)

	dy = (
	pt.tensor(mol["gradient_vector"][conf], dtype=pt.float32)
	* self.HARTREE_TO_EV
	/ self.BORH_TO_ANGSTROM
	)

	all_dy = (
	pt.tensor(mol["dft_total_gradient"], dtype=pt.float32)
	* self.HARTREE_TO_EV
	/ self.BORH_TO_ANGSTROM
	)

	self.vec1_proj = nn.Linear(hidden_channels, hidden_channels)
	self.vec2_proj = nn.Linear(hidden_channels, out_channels)

	def validation_step(self, batch, batch_idx, *args):
	if len(args) == 0 or (len(args) > 0 and args[0] == 0):
	# validation step
	return self.step(batch, mse_loss, "val")
	# test step
	return self.step(batch, l1_loss, "test")

	def test_step(self, batch, batch_idx):
	return self.step(batch, l1_loss, "test")