benzoin96485 / enerzyme Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 3.01 MB

Next generation machine learning force field on enzymatic catalysis

Python 99.47% Jupyter Notebook 0.53%

enerzyme's Introduction

Enerzyme

Towards next generation machine learning force field on enzymatic catalysis.

Current model architectures:

PhysNet: J. Chem. Theory Comput. 2019, 15, 3678−3693
SpookyNet: Nat. Commun. 2021, 12(1), 7273

Usage

Installation

Recommended environment

python==3.10.12
pip==23.2.1
setuptools==68.1.2
h5py==3.9.0
numpy==1.24.3
addict==2.4.0
tqdm==4.66.1
joblib==1.3.2
pandas==2.1.0
pytorch==2.0.1
scikit-learn==1.3.0
ase==3.22.1
transformers==4.33.1
torch-ema==0.3
pyyaml==6.0.1

pip install -e .

Training

Energy (force) / Atomic Charge / Dipole moment fitting.

enerzyme train -c <configuration yaml file> -o <output directory>

Please see enerzyme/config/train.yaml for details and recommended configurations.

Enerzyme saves the preprocessed dataset, split indices, final <configuration yaml file>, and the best model to the <output directory>.

Evaluation

Energy (force) / Atomic Charge / Dipole moment prediction.

enerzyme predict -c <configuration yaml file> -o <output directory> -m <model directory>

Please see enerzyme/config/predict.yaml for details.

Enerzyme reads the <model directory> for the model configuration, load the models, predict the results from all active models, save the predicted values as a pickle in the corresponding model subfolders, and report the results as a csv file in the <output directory>.

Simulation

Supported simulation types:

Flexible scanning on the distance between two atoms.
Constrained Langevin MD

enerzyme simulate -c <configuration yaml file> -o <output directory> -m <model directory>

Enerzyme reads the <model directory> for the model configuration, load the models, do simulation, and report the results in the <output directory>.

enerzyme's People

Contributors

Stargazers

Watchers

enerzyme's Issues

Optimize the storage of the full neighbor list

The full neighbor list scales as O($N^2$) with the system size and occupies a large disk space when storing the preprocessed dataset. Actually only one full neighbor list is needed if the dataset contains only one system with different configurations. The storage of atom types and total charges can be optimized as well.

General-purpose dispersion layer: DFT-D4

DFT-D4 energy calculation should be separated from SpookyNet. A general layer with positions as inputs and atomic dispersion energy as output are better for extensibility.

Support and standardize the format of datasets

As npz and hdf5 are more advanced formats storing huge datasets, enerzyme should support and standardize reading data from them. The picklized format should be further standardized, too.

Features calculation, preloading and storage

For fast computation and rigorous reproduction among different models, the data splitting and processed features, including scaled and translated energies, atomic numbers, neighbor lists, batch segmentations, ... should be stored for reuse.

General-purpose Coulomb layer

Coulomb energy calculation should be separated from specific models. A general layer with atomic charges and positions as inputs and atomic Coulomb energy as output are better for extensibility.

Package distribution through setup.py

Writing setup.py to install the package locally. A command enerzyme should be registered to invoke the main.py to do training, evaluation or simulation.
The direct reason for raising this issue is the separation between currently running models and models under development.

General-purpose dispersion layer: DFT-D3

DFT-D3 energy calculation should be separated from PhysNet. A general layer with positions as inputs and atomic dispersion energy as output are better for extensibility.

Support EMA training scheme

Exponential moving average is a training scheme that is widely used in machine learning force field including PhysNet, SpookyNet, and NequiP. It's believed to be able to improve the convergence and thus necessary to study its effect. Implementation in NequiP repo can be a reference.

SpookyNet refactor

Support SpookyNet with model builders as done in PhysNet. Share the public modules and layers.

Add total energy scaling and shifting transformation

This type of data normalization is not well defined when priors like ZBL repulsion, electrostatic energy and dispersion correction are introduced. However, comparison should be made especially with the vanilla NequiP results to make sure that the results' advantage doesn't come from data normalization.

Add energy decomposition monitoring

As electrostatic, ZBL repulsion and dispersion correction layers are used, it's important to monitor how they evolve along the training curve. The first step is providing a monitoring option to report the averaged energy terms in the training set/validation set in every epochs.