deeprank / iscore Goto Github PK

iScore: an MPI supported software for ranking protein-protein docking models based on a random walk graph kernel and support vector machines

License: Apache License 2.0

Python 90.29% Makefile 0.79% Cuda 8.40% C++ 0.52%

iscore's Introduction

⚠️ Archiving Note

This repository is no longer being maintained and has been archived for historical purposes.

We have now developed DeepRank2, an improved and unified version of DeepRank, DeepRank-GNN, and DeepRank-Mut.

✨ DeepRank2 allows for transformation and storage of 3D representations of both protein-protein interfaces (PPIs) and protein single-residue variants (SRVs) into either graphs or volumetric grids containing structural and physico-chemical information. These can be used for training neural networks for a variety of patterns of interest, using either our pre-implemented training pipeline for graph neural networks (GNNs) or convolutional neural networks (CNNs) or external pipelines.

🔧 Pull Requests at github.com/DeepRank/deeprank2/pulls
🐛 Bugs: Reports of bugs can be filed agains our new repo github.com/DeepRank/deeprank2/issues
⭐ Feature Requests: Add your request or discuss the project w/ the community at github.com/DeepRank/deeprank2/issues

We look forward to seeing you in our new space - DeepRank2!

DeepRank

Overview
Installation
Quick Tutorial
Documentation
License
Issues & Contributing

Overview

DeepRank is a general, configurable deep learning framework for data mining protein-protein interactions (PPIs) using 3D convolutional neural networks (CNNs).

DeepRank contains useful APIs for pre-processing PPIs data, computing features and targets, as well as training and testing CNN models.

Features:

Predefined atom-level and residue-level PPI feature types
- e.g. atomic density, vdw energy, residue contacts, PSSM, etc.
Predefined target types
- e.g. binary class, CAPRI categories, DockQ, RMSD, FNAT, etc.
Flexible definition of both new features and targets
3D grid feature mapping
Efficient data storage in HDF5 format
Support both classification and regression (based on PyTorch)

Installation

DeepRank requires a Python version 3.7 or 3.8 on Linux and MacOS. Make sure that mpi4py is installed in your environment before installing deeprank: conda install mpi4py

Stable Release

DeepRank is available in stable releases on PyPI:

Install the module pip install deeprank

Development Version

You can also install the under development source code from the branch development

Clone the repository git clone --branch development https://github.com/DeepRank/deeprank.git
Go there cd deeprank
Install the package pip install -e ./

To check if installation is successful, you can run a test

Go into the test directory cd test
Run the test suite pytest

Tutorial

We give here the tutorial like introduction to the DeepRank machinery. More informatoin can be found in the documentation http://deeprank.readthedocs.io/en/latest/. We quickly illsutrate here the two main steps of Deeprank:

the generation of the data
running deep leaning experiments.

A . Generate the data set (using MPI)

The generation of the data require only require PDBs files of decoys and their native and the PSSM if needed. All the features/targets and mapped features onto grid points will be auomatically calculated and store in a HDF5 file.

from deeprank.generate import *
from mpi4py import MPI

comm = MPI.COMM_WORLD

# let's put this sample script in the test folder, so the working path will be deeprank/test/
# name of the hdf5 to generate
h5file = './hdf5/1ak4.hdf5'

# for each hdf5 file where to find the pdbs
pdb_source = ['./1AK4/decoys/']


# where to find the native conformations
# pdb_native is only used to calculate i-RMSD, dockQ and so on.
# The native pdb files will not be saved in the hdf5 file
pdb_native = ['./1AK4/native/']


# where to find the pssm
pssm_source = './1AK4/pssm_new/'


# initialize the database
database = DataGenerator(
    chain1='C', chain2='D',
    pdb_source=pdb_source,
    pdb_native=pdb_native,
    pssm_source=pssm_source,
    data_augmentation=0,
    compute_targets=[
        'deeprank.targets.dockQ',
        'deeprank.targets.binary_class'],
    compute_features=[
        'deeprank.features.AtomicFeature',
        'deeprank.features.FullPSSM',
        'deeprank.features.PSSM_IC',
        'deeprank.features.BSA',
        'deeprank.features.ResidueDensity'],
    hdf5=h5file,
    mpi_comm=comm)


# create the database
# compute features/targets for all complexes
database.create_database(prog_bar=True)


# define the 3D grid
 grid_info = {
   'number_of_points': [30,30,30],
   'resolution': [1.,1.,1.],
   'atomic_densities': {'C': 1.7, 'N': 1.55, 'O': 1.52, 'S': 1.8},
 }

# Map the features
database.map_features(grid_info,try_sparse=True, time=False, prog_bar=True)

This script can be exectuted using for example 4 MPI processes with the command:

    NP=4
    mpiexec -n $NP python generate.py

In the first part of the script we define the path where to find the PDBs of the decoys and natives that we want to have in the dataset. All the .pdb files present in pdb_source will be used in the dataset. We need to specify where to find the native conformations to be able to compute RMSD and the dockQ score. For each pdb file detected in pdb_source, the code will try to find a native conformation in pdb_native.

We then initialize the DataGenerator object. This object (defined in deeprank/generate/DataGenerator.py) needs a few input parameters:

pdb_source: where to find the pdb to include in the dataset
pdb_native: where to find the corresponding native conformations
compute_targets: list of modules used to compute the targets
compute_features: list of modules used to compute the features
hdf5: Name of the HDF5 file to store the data set

We then create the data base with the command database.create_database(). This function autmatically create an HDF5 files where each pdb has its own group. In each group we can find the pdb of the complex and its native form, the calculated features and the calculated targets. We can now mapped the features to a grid. This is done via the command database.map_features(). As you can see this method requires a dictionary as input. The dictionary contains the instruction to map the data.

number_of_points: the number of points in each direction
resolution: the resolution in Angs
atomic_densities: {'atom_name': vvdw_radius} the atomic densities required

The atomic densities are mapped following the protein-ligand paper. The other features are mapped to the grid points using a Gaussian function (other modes are possible but somehow hard coded)

Visualization of the mapped features

To explore the HDf5 file and vizualize the features you can use the dedicated browser https://github.com/DeepRank/DeepXplorer. This tool saloows to dig through the hdf5 file and to directly generate the files required to vizualie the features in VMD or PyMol. An iPython comsole is also embedded to analyze the feature values, plot them etc ....

B . Deep Learning

The HDF5 files generated above can be used as input for deep learning experiments. You can take a look at the file test/test_learn.py for some examples. We give here a quick overview of the process.

from deeprank.learn import *
from deeprank.learn.model3d import cnn_reg
import torch.optim as optim
import numpy as np

# input database
database = '1ak4.hdf5'

# output directory
out = './my_DL_test/'

# declare the dataset instance
data_set = DataSet(database,
            chain1='C',
            chain2='D',
            grid_info={
                'number_of_points': (10, 10, 10),
                'resolution': (3, 3, 3)},
            select_feature={
                'AtomicDensities': {'C': 1.7, 'N': 1.55, 'O': 1.52, 'S': 1.8},
                'Features': ['coulomb', 'vdwaals', 'charge', 'PSSM_*']},
            select_target='DOCKQ',
            normalize_features = True, normalize_targets=True,
            pair_chain_feature=np.add,
            dict_filter={'DOCKQ':'<1'})


# create the network
model = NeuralNet(data_set,cnn_reg,model_type='3d',task='reg',
                  cuda=False,plot=True,outdir=out)

# change the optimizer (optional)
model.optimizer = optim.SGD(model.net.parameters(),
                            lr=0.001,
                            momentum=0.9,
                            weight_decay=0.005)

# start the training
model.train(nepoch = 50,divide_trainset=0.8, train_batch_size = 5,num_workers=0)

In the first part of the script we create a Torch database from the HDF5 file. We can specify one or several HDF5 files and even select some conformations using the dict_filter argument. Other options of DataSet can be used to specify the features/targets the normalization, etc ...

We then create a NeuralNet instance that takes the dataset as input argument. Several options are available to specify the task to do, the GPU use, etc ... We then have simply to train the model. Simple !

Issues and Contributing

If you have questions or find a bug, please report the issue in the Github issue channel.

If you want to change or further develop DeepRank code, please check the Developer Guideline to see how to conduct further development.

iscore's People

Contributors

Stargazers

Watchers

Forkers

codacy-badger minghao2016 aspirincode elseviersoftwarex caiyingchun lifeixianshen hardhary yuyang-0825 cffbots ricomnl rnaimehaom

iscore's Issues

Could we make sure that GraphRank score is the lower the better

In GraphRank.dat file, it currently reports:#Name label pred decision_value

Could we report: #Name label pred decision_value GraphRank_score, where we make sure the GraphRank_score is the lower the better. In the current paper, we use “native is labeled as class 1 and non-native 0, positive decision_value is for class 1”.

fail to repeat 684 dataset result

Describe the bug
when I run iScore.predict on one of data sets, such as 1ACB, different iScores were given. In 684, the file refers to 1ACB.iScore. In the result of iScore.predict, the file refers to iScorePredict.dat

Environment:

OS system: Linux VM-88-194-centos 3.10.107-1-tlinux2_kvm_guest-0049
Version:
Branch commit ID: lastest
Inputs: the same pdb and pssm in 684

To Reproduce
Steps/commands to reproduce the behaviour:

iScore.predict

Expected Results
Ideally, I want to get the same result files which were provided in 684 datasets including haddock energy scores.

About the outputs

Hi, thanks for the great work! I am wondering which results are the final iScore. Is it the last column in ScorePredict.dat? Many thanks :)

h5x support does not work on windows

Describe the bug
On Windows the path of all the pdbs generated in loadData.py should contain double back slash in the path : C:\Users\Name\ ..... instead of single ones. Also the path in the example/h5x/graphs.hdf5 are local to my own machine ...

Environment:

OS system: Window
Version: lastest
Branch commit ID: master

To Reproduce
Steps/commands to reproduce the behaviour:

create the hdf5 graph with example/kernel/create_kernel.py
launch h5x : bin/iScore.h5x
right click on mol and select PyMol

Expected Results
A nice PyMol viz of the interface

Actual Results or Error Info
Unicode error due to the single \ when loading the pdbs

Additional Context
Add any other context about the problem here.

need the PSSM class to allow input sequence

The current PSSM class takes a pdb file as input and extracts sequences from the pdb file. However, many pdb files do not have a complete sequence (missing segments) and this will affect the quality of pssm.

Can we make the pssm class to allow the user input sequences?

iScore normalization needs to be checked

iScore normalization:

(X-median)/IQR
median and IQR are calculated within a case (all models from a single case)

Dependabot couldn't authenticate with https://pypi.python.org/simple/

Dependabot couldn't authenticate with https://pypi.python.org/simple/.

You can provide authentication details in your Dependabot dashboard by clicking into the account menu (in the top right) and selecting 'Config variables'.

View the update logs.

install blast and its db in setup.py

There are 4 steps in total: 1. download psiblast. 2. download the blast nr database (we can provide it so that the results are reproducible). 3. run psiblast to generate pssm 4. reformat pssm to a format that iscore uses

provide training_set.tar.gz

I could not find the training_set.tar.gz from BM5 dimers in our iscore repository. Could you please tell me where it is? Most users do not want to retrain iscore and simply want to use it

error when some pdb files cannot have graph generated

When some pdb files cannot have graph generated, we will have the error message below. In this situation, all the pdb files have haddock energy (Energy.dat) and some models are missing from graphrank file (GraphRank.dat).

Traceback (most recent call last):
File "/home/lixue/tools/iScore/bin/iScore.predict", line 63, i
iscore()
File "/home/lixue/tools/iScore/iScore/function/score.py", line
self.read_graphrank()
File "/home/lixue/tools/iScore/iScore/function/score.py", line
data.append(self.features[m]['grank'])
KeyError: 'grank'

This bug can be fixed by switching the order of the following two lines (quick and dirty):

From

          self.read_graphrank()
         self.read_energy()

          self.read_energy()
          self.read_graphrank()

need proper error messages when pssm does not match pdb

When pssm file does not match the pdb file, K matrix is empty. Can we give a proper error messages when pssm does not match pdb? Can we also give proper error messages when K matrix is empty?

Use PSSMGen or not?

The repo README mentions

The PSSM files can be calculated using PSSMGen https://github.com/DeepRank/PSSMGen.

I think it's nice to have an independent package for generating PSSM.
Then it is not necessary to keep the module pssm in iScore which is same as PSSMGen package.
@LilySnow @NicoRenaud How do you think?

Can we make pssm mapping to pdb as default in iScore

Can we make pssm mapping to pdb as default in iScore? Otherwise, it is very buggy...

openmpi or mpich

Does openmpi or mpich support windows

To update docs for code v0.3.4

Docs should be updated to be consistent with the code of version 0.3.4.

update the tutorial
update how to use commands and python modules for generating graph and kernel, training and test.
organise the code documentation

issues of iscore doc

I am following iScore workflow and got the error message below:

(base) [lixue@alcazar test]$ mpiexec -n 2 iScore.predict.mpi --archive ../train/training_set.tar.gz
Warning : pycuda not found
Warning : scikit-cuda not found
Warning : pycuda not found
Warning : scikit-cuda not found
Traceback (most recent call last):
File "/home/lixue/tools/iScore/bin/iScore.predict.mpi", line 81, in
iscore_svm(load_model=None,package_name=args.archive,testID=args.ground_truth)
File "/home/lixue/tools/iScore/iScore/graphrank/rank.py", line 401, in iscore_svm
testdata = DataSet(trainID,Kfile,maxlen,testID=testID)
File "/home/lixue/tools/iScore/iScore/graphrank/rank.py", line 39, in init
self.get_K_matrix()
File "/home/lixue/tools/iScore/iScore/graphrank/rank.py", line 112, in get_K_matrix
key = list(K.keys())[1]
IndexError: list index out of range

the differences between python iscore and our original matlab code

According to Cunliang, here are the differences:

max walk length: 4 v.s. 3
SVM training parameters svm_parameter('-t 4 -c 4 -b 1') v.s. svm_parameter('-t 4 -c 1 -b 0')

how to uninstall iscore

I installed iscore in one location. Then I deleted it and wanted to install it in another location. But I got the following error message when running pip install -e ./:

(base) [lixue@alcazar iScore]$ pip install -e ./
Obtaining file:///home/lixue/tools/iScore
Requirement already satisfied: numpy>=1.13 in /home/lixue/tools/anaconda3/lib/python3.6/site-packages (from iScore==0.0) (1.16.2)
Requirement already satisfied: scipy in /home/lixue/tools/anaconda3/lib/python3.6/site-packages (from iScore==0.0) (1.2.1)
Installing collected packages: iScore
Found existing installation: iScore 0.0
Exception:
Traceback (most recent call last):
File "/home/lixue/tools/anaconda3/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 179, in main
status = self.run(options, args)
File "/home/lixue/tools/anaconda3/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 393, in run
use_user_site=options.use_user_site,
File "/home/lixue/tools/anaconda3/lib/python3.6/site-packages/pip/_internal/req/init.py", line 50, in install_given_reqs
auto_confirm=True
File "/home/lixue/tools/anaconda3/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 816, in uninstall
uninstalled_pathset = UninstallPathSet.from_dist(dist)
File "/home/lixue/tools/anaconda3/lib/python3.6/site-packages/pip/_internal/req/req_uninstall.py", line 505, in from_dist
'(at %s)' % (link_pointer, dist.project_name, dist.location)
AssertionError: Egg-link /data/lixue/capri/Capri47/T160/test_iscore/iScore does not match installed location of iScore (at /home/lixue/tools/iScore)

UnpicklingError: could not find MARK

Describe the bug
A clear and concise description of what the bug is. UnpicklingError: could not find MARK occurs when attempting to run iScore.predict using the trained model.

Environment:

OS system: Ubuntu
Version: 20.04
Branch commit ID:
Inputs:

To Reproduce
Steps/commands to reproduce the behaviour:

created pdb and pssm folders under iScore/test directory
typed either iScore.predict or mpiexec -n iScore.predict.mpi within the test directory (both gave the same error)

Expected Results
A. Expected the creating of iScore.predict.txt file

Actual Results or Error Info
nick@nick-HP-ZBook-15-G5:~/iScore/test$ iScore.predict
/home/nick/.local/lib/python3.8/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
warnings.warn('creating CUBLAS context to get version number')
Reusing graphs in ./graph/
Reusing kernels in ./kernel/
Traceback (most recent call last):
File "/home/nick/.local/bin/iScore.predict", line 7, in
exec(compile(f.read(), file, 'exec'))
File "/home/nick/iScore/bin/iScore.predict", line 68, in
iscore_svm(load_model=None,package_name=args.archive,maxlen=args.maxlen,testID=args.ground_truth)
File "/home/nick/iScore/iScore/graphrank/rank.py", line 400, in iscore_svm
testdata = DataSet(trainID,Kfile,maxlen,testID=testID)
File "/home/nick/iScore/iScore/graphrank/rank.py", line 39, in init
self.get_K_matrix()
File "/home/nick/iScore/iScore/graphrank/rank.py", line 104, in get_K_matrix
K.update(pickle.load(open(f,'rb')))
_pickle.UnpicklingError: could not find MARK

Additional Context
Example of pdb and pssm files attached.
1084_TS029_1o.zip

need pssm mapping code to be more flexible

Many times, we just need to calculate pssm once (time-consuming) and map it to all the pdb files (fast). The current code does not seem to allow it.

a bug with "mpiexec -n ${NPROC} iScore.predict"

When I run "mpiexec -n ${NPROC} iScore.predict" on capri target 161, I ran into a bug.

I run this command for the first time, I will get the following error message:

Reusing kernels in ./kernel/
Traceback (most recent call last):
File "/home/lixue/tools/iScore/bin/iScore.predict", line 57, in
iscore_svm(load_model=None,package_name=args.archive,testID=args.ground_truth)
File "/home/lixue/tools/iScore/iScore/graphrank/rank.py", line 402, in iscore_svm
testdata = DataSet(trainID,Kfile,maxlen,testID=testID)
File "/home/lixue/tools/iScore/iScore/graphrank/rank.py", line 29, in init
self.test_name, self.test_class = self._get_ids(testID)
File "/home/lixue/tools/iScore/iScore/graphrank/rank.py", line 80, in _get_ids
nc = len(idlist[0])
IndexError: list index out of range

The folder of 'graph' will be calculated successfully but 'kernel' folder will not.

I then deleted the folder of 'kernel', and rerun the same command. This time the kernel folder is calculated correctly. But I got the following error message:

Traceback (most recent call last):
File "/home/lixue/tools/iScore/bin/iScore.predict", line 63, in
iscore()
File "/home/lixue/tools/iScore/iScore/function/score.py", line 17, in init
self.read_graphrank()
File "/home/lixue/tools/iScore/iScore/function/score.py", line 56, in read_graphran
data.append(self.features[m]['grank'])
KeyError: 'grank'

Then I ran the same command the 3rd time, iScorePredict.dat is finally generated.

The data is on alcazar: /home/lixue/CAPRI/Capri48/t161/iscore/before_cleaning/results_li

the pssm class hard-coded the file structure

The pssm class requires the following folder structure:

--working_dir
   |
    --caseID
        |
         -- pdb
             |
             --- pdb files

We can either write this requirement in the doc or change the code to be more flexible.
Which option should we go for?

let iscore to report problematics pdb files when they do not match the pssm file

Could we let iscore report problematics pdb files when the pssm files give too many warnings (missing residue warnings)?

deeprank / iscore Goto Github PK

iscore's Introduction

⚠️ Archiving Note

DeepRank

Contents

Overview

Features:

Installation

Stable Release

Development Version

Tutorial

A . Generate the data set (using MPI)

Visualization of the mapped features

B . Deep Learning

Issues and Contributing

iscore's People

Contributors

Stargazers

Watchers

Forkers

iscore's Issues

Recommend Projects

Recommend Topics

Recommend Org