optimal-pse-lab / deepdock Goto Github PK

Code related to : O. Mendez-Lucio, M. Ahmad, E.A. del Rio-Chanona, J.K. Wegner, A Geometric Deep Learning Approach to Predict Binding Conformations of Bioactive Molecules

License: MIT License

Python 18.33% Jupyter Notebook 80.68% Shell 0.12% Dockerfile 0.86%

deepdock's Introduction

DeepDock

Open access preprint available here

Use v1.0.0 to reproduce results reported in the paper

DeepDock_optimization_2RKA.mp4

Table of Contents

About The Project
Getting Started
Usage
License
Contact
Acknowledgements

About The Project

This method is based on geometric deep learning and is capable of predicting the binding conformations of ligands to protein targets. Concretely, the model learns a statistical potential based on distance likelihood which is tailor-made for each ligand-target pair. This potential can be coupled with global optimization algorithms to reproduce experimental binding conformations of ligands.

We showed that:

Geometric deep learning can learn a potential based on distance likelihood for ligand-target interactions
This potential performs similar or better than well-established scoring functions for docking and screening tasks
It can be coupled with global optimization algorithms to reproduce experimental binding conformations of ligands

Getting Started

Prerequisites

This package runs using Pytorch and Pytorch Geometric. On top it uses standard packages such as pandas and numpy. For the complete list have a look into the requirements.txt file

install requirements.txt
```
pip install -r requirements.txt
```

install RDKIT

conda install -c conda-forge rdkit=2019.09.1

Installation

Using Docker image

Install docker
Pull docker image from DockerHub
```
docker pull omendezlucio/deepdock
```

Launch the container.

docker run -it omendezlucio/deepdock:latest

Using Dockerfile

To build an image and run it from scratch:

Install docker

Clone repo and move into project folder.

git clone https://github.com/OptiMaL-PSE-Lab/DeepDock.git
cd DeepDock

Build the docker image. This takes 20-30 mins to build
```
docker build -t deepdock:latest .
```

Launch the container.

docker run -it --rm --name deepdock-env deepdock:latest

From source

Clone the repo

git clone https://github.com/OptiMaL-PSE-Lab/DeepDock.git

Move into the project folder and update submodules

cd DeepDock
git submodule update --init --recursive

Install prerequisite packages

conda install -c conda-forge rdkit=2019.09.1
pip install -r requirements.txt

Install DeepDock pacakge
```
pip install -e .
```

Data

You can get training and testing data following the next steps.

Move into the project data folder
```
cd DeepDock/data
```
Use the following line to download the preprocessed data used to train and test the model. This will download two files, one containing PDBbind (2.3 GB) used for training and another containing CASF-2016 (32 MB) used for testing. These two files are enough to run all examples.
```
source get_deepdock_data.sh
```
In case you want to reproduce all results of the paper you will need to download the complete CASF-2016 set (~1.5 GB). You can do so with this command line from the data folder.
```
source get_CASF_2016.sh
```

Usage

Usage examples can be seen directly in the jupyter notebooks included in the repo. We added examples for:

License

Distributed under the MIT License. See LICENSE for more information.

Contact

@omendezl and @AntonioE89

Project Link: DeepDock

Acknowledgements

deepdock's People

Contributors

Stargazers

Watchers

deepdock's Issues

Unable to load data for training example

RuntimeError: The 'data' object was created by an older version of PyG. If this error occurred while loading an already existing dataset, remove the 'processed/' directory in the dataset's root folder and try again.

This error is thrown when I tried to run the second cell in the Train_DeepDock.ipynb example.
I have tried using the Archive manager to open the dataset.tar but it somehow fails to extract, may I know what is contained within the tar files? Is it a regular tar archive file or some other file renamed with a .tar extension?

I am currently using torch-geometric 2.0.3, which is a later version than the one in the requirements.txt but unfortunately my RTX 30 series hardware only supports CUDA 11+ and CUDA 10 won't work, therefore I am unable to use older versions or torch/torch geometric.

May I know if there are any other ways I could get my hands on the training data?

Thank you.

GPU utilization is very low

First, I change the input data from type Data to type HeteroData, which contains target,and ligand, and replace dataloader with datalistloader, in order to train model on multi_GPU. But the GPU utilization is extremely low through training, for example, 15%-20% for each GPU When I use 4 GPU to train. And I also train models on single GPU, GPU utilization is just 30%, a little higher than multi-GPU.
For multi-GPU, batch_size is 12; For single GPU, batch_size is 3.

dataset download issue

Hi,
I found that the downloaded training set and test set data are damaged by the data/get_deepdock_data.sh. Can you update the data set again?
Thanks.
David.

About screening power calculation

Your article "https://www.nature.com/articles/s42256-021-00409-9" is great, thank you very much for sharing the source code and test results. But I have some questions about the results of the CASF2016 Screening power test.
In your ForwardScreeningPower_Deepdocks_3A.out:
The best ligand is found among top 1% candidates for 25 cluster(s); success rate = 43.9%
The best ligand is found among top 5% candidates for 35 cluster(s); success rate = 61.4%
The best ligand is found among top 10% candidates for 47 cluster(s); success rate = 82.5%

Why do the results have a different number of clusters instead of a uniform 57 clusters？
Looking forward to your reply, thanks!

Constant additive term in MDN

DeepDock/deepdock/models.py

Line 218 in ab1e450

sigma = F.elu(self.z_sigma(C))+1.1

Hi,
first of all thanks for sharing the code.

Just a simple question: what is the meaning of the +1.1 and +1 in the the output for sigma and mu in the mixture density network? Is it some kind of prior knowledge you incorporate in the model? Or simply some numeric regularization?

Thanks in advance for you answer.

libcusparse issue

I got error about libcusparse, when I ran Train DeepDock.
Could you please help me to solve this issue?

Dataload error while running training_DeepDock.ipynb

Hello, I try to run training_DeepDock.ipynb.
With my environment pytorch=1.10.2 and pyg=2.0.3, I face following error while loading preprocessed data with following error

Traceback (most recent call last):                                                                                              
  File "/db2/users/kyuhyunlee/DeepDock_test/train_deepdock.py", line 21, in <module>                                            
    db_complex = PDBbind_complex_dataset(data_path=deepdock.__path__[0]+'/../data/dataset_deepdock_pdbbind_v2019_16K.tar',      
  File "/db2/users/kyuhyunlee/git_repos/DeepDock/deepdock/utils/data.py", line 60, in __init__                                  
    self.data = [i for i in self.data if not np.isnan(i[1].x.numpy().min())]                                                    
  File "/db2/users/kyuhyunlee/git_repos/DeepDock/deepdock/utils/data.py", line 60, in <listcomp>                                
    self.data = [i for i in self.data if not np.isnan(i[1].x.numpy().min())]                                                    
  File "/db2/users/kyuhyunlee/anaconda3/envs/py39_default/lib/python3.9/site-packages/torch_geometric/data/data.py", line 642, i
n x                                                                                                                             
    return self['x'] if 'x' in self._store else None                                                                            
  File "/db2/users/kyuhyunlee/anaconda3/envs/py39_default/lib/python3.9/site-packages/torch_geometric/data/data.py", line 357, i
n __getattr__                                                                                                                   
    raise RuntimeError(                                                                                                         
RuntimeError: The 'data' object was created by an older version of PyG. If this error occurred while loading an already existing
 dataset, remove the 'processed/' directory in the dataset's root folder and try again.

Is there any method to fix it? I already know DeepDock need older version of pytorch and pytorch_geometric as prerequisities but It will be nice to use DeepDock with latest version of pytorch and pyg.

great work, do you have any plan to update?

since the torch is updated.. do you have any plan to update?

BTW, how to train my own data? epecially how to manage the training data?

Thanks

Large Scale Screening

Hi,

I have a question about large scale screening workflow.

Could you please suggest which would be the optimal usage of DeepDock for large scale screening?

Specifically, thousands of molecules all stored within the same .mol2 file against one single protein .pdb. If you could tell us which functions/we have to run in sequential way, it would be really helpful.

Thank you.

Wrong RMSD calculation?

I noticed there is a problem with the docking pose generation script you provided #Docking_CASF2016_CoreSet.ipynb.

In dock_compound function, there is a line
result['rmsd'] = Chem.rdMolAlign.AlignMol(opt_mol, real_mol, atomMap=list(zip(opt.noHidx,opt.noHidx))).

This means you are aligning opt_mol to real_mol, then calculate the rmsd, when I delete this line, the docking pose will become much worse than before. This operation is only used in the molecule generation task but not in the docking task, it will neglect the error of rotation and translation of ligand.
You should output the conformer to a file and then use obrms(from openbabel) to calculate the real reliable rmsd.

I reproduced your experiment, using the code directly from your code the result successful rate is correct and is about 62%.
After I removed the alignment line and redo the experiments, the successful rate drops down to 41%. (percentage of docking pose is <2A RMSD compare to crystal-structure)

how to gennerate ply file?

Is there any method to specify binding site in script?

Hi,

Is there any method to specify binding site in script?
The result pose is binding in different site unlike raw protein-ligand complex structures.

JeongSoo Na

Charge file not generated/found

FileNotFoundError: [Errno 2] No such file or directory: '1z6e_protein_temp_out.csv'

Running the cell with compute_inp_surface(target_filename, ligand_filename, dist_threshold=10) in the Docking_example.ipynb throws this error.

Is there something that needs to be done with the multivalue binary in ABPS to return the tmp_file_base+"_out.csv"?

The MULTIVALUE_BIN enviornment variable is pointing to the exact path for the multivalue binary and all the steps prior to this works to output files, like the tmp_file_base + ".csv".

Also, may I ask what this three lines of code are actually doing?
multivalue = multivalue_bin + " %s %s %s"
make_multivalue = multivalue % (tmp_file_base+".csv", tmp_file_base+".dx", tmp_file_base+"_out.csv")
os.system(make_multivalue)

Thank you.

MSMS error

I have navigated MSMS as
export MSMS_BIN=/,,,/MSMS
but I got error. So could you please help me to solve it?

Error while building docker container

Hi,

I've tried to build the docker container and I get this error :

Step 22/26 : RUN ["wget", "-O", "reduce.gz", "http://kinemage.biochem.duke.edu/php/downlode-3.php?filename=/../downloads/software/reduce31/reduce.3.23.130521.linuxi386.gz"]
---> Running in d2006b089496
--2021-12-07 13:27:22-- http://kinemage.biochem.duke.edu/php/downlode-3.php?filename=/../downloads/software/reduce31/reduce.3.23.130521.linuxi386.gz Resolving kinemage.biochem.duke.edu (kinemage.biochem.duke.edu)... 40.76.186.240
Connecting to kinemage.biochem.duke.edu (kinemage.biochem.duke.edu)|40.76.186.240|:80... connected. HTTP request sent, awaiting response... 404 Not Found 2021-12-07 13:27:22 ERROR 404: Not Found.

The command 'wget -O reduce.gz http://kinemage.biochem.duke.edu/php/downlode-3.php?filename=/../downloads/software/reduce31/reduce.3.23.130521.linuxi386.gz' returned a non-zero code: 8
Looking forward to test deepdock ^^

Screening power does not match

Hi,
Could you please release your Score_decoys_screening_CASF2016.csv. file? I tried your checkpoint with your screening power calculation notebook scripts but can not get the same result as shown in your paper. The code I used for the screening success rate is from CASF-16 and I checked the example data (e.g. AnadockVina) and the results are correct for them.
Best

batch_size ->8, then show the error:

when i changed the batch_size->8
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

Model Retraining

Hi,

Great work!

I have a question about the model retraining.

To retrain the model on a new dataset, how shall the data be structure? In the current state, DeepDock loads an entire .tar archive (which I suppose contains proteins and ligands structures) for training. However, we did not manage to decompress it and thus it is unclear how the files should be organised for training the algorithm de-novo (i.e., on a new training set).

Thank you!

What is the purpose of sanitize and cleanupSubstructures in RDkit function?

What is the purpose of sanitize and cleanupSubstructures in RDkit function? What happened if I set sanitize=True, cleanupSubstructures=True ?

real_mol = Chem.MolFromMol2File('1z6e_ligand.mol2',sanitize=False, cleanupSubstructures=False)

how to represent the ligand as a vector of the Euler angles, the relative position of the ligand?

hi, @omendezlucio
I am interesting in represent the ligand as a vector of the Euler angles, the relative position of the ligand. I am confused in it and didn't find the code in your project. Would you tell me the principle about this process，or share some code about it?

Ply Files

Hi,

Once again, great work!

I have a question about the ply files used by DeepDock.

We noticed that inside the data folder there are ".ply" files. We did not understand whether these files are computed on the fly during DeepDock scoring, or if they should be computed separately before scoring. Could you elaborate?

Thank you!

installation and examples

hi ,

i downloaded the Docker image as in:
docker pull omendezlucio/deepdock
docker run -it omendezlucio/deepdock:latest
then copy/paste the source from
https://github.com/OptiMaL-PSE-Lab/DeepDock/blob/main/examples/Score_example.ipynb
into a python script:

from rdkit import Chem
import deepdock
from deepdock.models import *
from deepdock.DockingFunction import score_compound
from deepdock.DockingFunction import calculate_atom_contribution
import numpy as np
import torch
np.random.seed(123)
torch.cuda.manual_seed_all(123)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ligand_model = LigandNet(28, residual_layers=10, dropout_rate=0.10)
target_model = TargetNet(4, residual_layers=10, dropout_rate=0.10)
model = DeepDock(ligand_model, target_model, hidden_dim=64, n_gaussians=10, dropout_rate=0.10, dist_threhold=7.).to(device)
checkpoint = torch.load(deepdock.path[0]+'/../Trained_models/DeepDock_pdbbindv2019_13K_minTestLoss.chk', map_location=torch.device(device))
model.load_state_dict(checkpoint['model_state_dict'])
target_ply = deepdock.path[0]+'/../data/1z6e_protein.ply'
real_mol = Chem.MolFromMol2File(deepdock.path[0]+'/../data/1z6e_ligand.mol2',sanitize=False, cleanupSubstructures=False)
score = score_compound(real_mol, target_ply, model, dist_threshold=3., seed=123, device=device)
score

I called the script test.py and said:
python3 test.py

this runs for a few seconds, without any errors or warnings, and then exits without giving me any output,
shouldn't the final command (score) print out the score to stdout?
talking about commands ... are there any plans to make a documentation for deepdock?

when inside docker i cannot execute the jupyter notebook, as there is no browser in this image.
therefore i tried to install form source as outlined on the webpage.
I say:

git clone https://github.com/OptiMaL-PSE-Lab/DeepDock.git
cd DeepDock/
conda create --name mydd
conda activate mydd
git submodule update --init --recursive
conda install -c conda-forge rdkit=2019.09.1
pip install -r requirements.txt

the last command starts running, and then i get:
[...]
Looking in links: https://pytorch-geometric.com/whl/torch-1.4.0.html, https://pytorch-geometric.com/whl/torch-1.4.0.html, https://pytorch-geometric.com/whl/torch-1.4.0.html, https://pytorch-geometric.com/whl/torch-1.4.0.html
Collecting torch==1.4.0
Using cached torch-1.4.0-cp38-cp38-manylinux1_x86_64.whl (753.4 MB)
Collecting torch-scatter==2.0.4+cu101
Using cached https://data.pyg.org/whl/torch-1.4.0/torch_scatter-2.0.4%2Bcu101-cp38-cp38-linux_x86_64.whl (10.6 MB)
Discarding https://data.pyg.org/whl/torch-1.4.0/torch_scatter-2.0.4%2Bcu101-cp38-cp38-linux_x86_64.whl (from https://pytorch-geometric.com/whl/torch-1.4.0.html): Requested torch-scatter==2.0.4+cu101 from https://data.pyg.org/whl/torch-1.4.0/torch_scatter-2.0.4%2Bcu101-cp38-cp38-linux_x86_64.whl (from -r requirements.txt (line 3)) has inconsistent version: filename has '2.0.4+cu101', but metadata has '2.0.4'
ERROR: Could not find a version that satisfies the requirement torch-scatter==2.0.4+cu101 (from versions: latest+cpu, latest+cu92, latest+cu100, latest+cu101, 0.3.0, 1.0.2, 1.0.3, 1.0.4, 1.1.0, 1.1.1, 1.1.2, 1.2.0, 1.3.0, 1.3.1, 1.3.2, 1.4.0, 2.0.2, 2.0.3, 2.0.3+cpu, 2.0.3+cu100, 2.0.3+cu101, 2.0.3+cu92, 2.0.4, 2.0.4+cpu, 2.0.4+cu100, 2.0.4+cu101, 2.0.4+cu92, 2.0.5, 2.0.6, 2.0.7, 2.0.8, 2.0.9)
ERROR: No matching distribution found for torch-scatter==2.0.4+cu101

any suggestions on how to deal with that?
(my system is Ubuntu 20.04.4)

thanks!
michael

Error when using another Ligand

I tried to use different ligands to dock with the proteins in the repo .According to the files in your examples, when I tried to dock with 2br1_protein.pdb and 2wtv_protein.pdb, it worked .But when I tried to dock with 1z6e_protein.pdb, it failed.
And error msg:

Traceback (most recent call last):
  File "test.py", line 7, in <module>
    compute_inp_surface(target_filename, ligand_filename, dist_threshold=10)
  File "/DeepDock/deepdock/prepare_target/computeTargetMesh.py", line 70, in compute_inp_surface
    structure = structures[0] # 'structures' may contain several proteins in this case only one.
  File "/opt/conda/lib/python3.6/site-packages/Bio/PDB/Entity.py", line 45, in __getitem__
    return self.child_dict[id]
KeyError: 0

It seems like something wrong when generating 1z6e_protein_15A.pdb.
And my Ligand file:
new1.mol2.zip
Thank you!

optimal-pse-lab / deepdock Goto Github PK

deepdock's Introduction

DeepDock

Use v1.0.0 to reproduce results reported in the paper

About The Project

Getting Started

Prerequisites

Installation

Using Docker image

Using Dockerfile

From source

Data

Usage

License

Contact

Acknowledgements

deepdock's People

Contributors

Stargazers

Watchers

Forkers

deepdock's Issues

Recommend Projects

Recommend Topics

Recommend Org