Giter Site home page Giter Site logo

deeprank / iscore Goto Github PK

View Code? Open in Web Editor NEW
26.0 26.0 11.0 21.15 MB

iScore: an MPI supported software for ranking protein-protein docking models based on a random walk graph kernel and support vector machines

License: Apache License 2.0

Python 90.29% Makefile 0.79% Cuda 8.40% C++ 0.52%

iscore's Introduction

⚠️ Archiving Note

This repository is no longer being maintained and has been archived for historical purposes.

We have now developed DeepRank2, an improved and unified version of DeepRank, DeepRank-GNN, and DeepRank-Mut.

✨ DeepRank2 allows for transformation and storage of 3D representations of both protein-protein interfaces (PPIs) and protein single-residue variants (SRVs) into either graphs or volumetric grids containing structural and physico-chemical information. These can be used for training neural networks for a variety of patterns of interest, using either our pre-implemented training pipeline for graph neural networks (GNNs) or convolutional neural networks (CNNs) or external pipelines.

We look forward to seeing you in our new space - DeepRank2!

DeepRank

PyPI Documentation Status DOI Build Codacy Badge Coverage Status

Contents

Overview

alt-text

DeepRank is a general, configurable deep learning framework for data mining protein-protein interactions (PPIs) using 3D convolutional neural networks (CNNs).

DeepRank contains useful APIs for pre-processing PPIs data, computing features and targets, as well as training and testing CNN models.

Features:

  • Predefined atom-level and residue-level PPI feature types
    • e.g. atomic density, vdw energy, residue contacts, PSSM, etc.
  • Predefined target types
    • e.g. binary class, CAPRI categories, DockQ, RMSD, FNAT, etc.
  • Flexible definition of both new features and targets
  • 3D grid feature mapping
  • Efficient data storage in HDF5 format
  • Support both classification and regression (based on PyTorch)

Installation

DeepRank requires a Python version 3.7 or 3.8 on Linux and MacOS. Make sure that mpi4py is installed in your environment before installing deeprank: conda install mpi4py

Stable Release

DeepRank is available in stable releases on PyPI:

  • Install the module pip install deeprank

Development Version

You can also install the under development source code from the branch development

  • Clone the repository git clone --branch development https://github.com/DeepRank/deeprank.git
  • Go there cd deeprank
  • Install the package pip install -e ./

To check if installation is successful, you can run a test

  • Go into the test directory cd test
  • Run the test suite pytest

Tutorial

We give here the tutorial like introduction to the DeepRank machinery. More informatoin can be found in the documentation http://deeprank.readthedocs.io/en/latest/. We quickly illsutrate here the two main steps of Deeprank:

  • the generation of the data
  • running deep leaning experiments.

A . Generate the data set (using MPI)

The generation of the data require only require PDBs files of decoys and their native and the PSSM if needed. All the features/targets and mapped features onto grid points will be auomatically calculated and store in a HDF5 file.

from deeprank.generate import *
from mpi4py import MPI

comm = MPI.COMM_WORLD

# let's put this sample script in the test folder, so the working path will be deeprank/test/
# name of the hdf5 to generate
h5file = './hdf5/1ak4.hdf5'

# for each hdf5 file where to find the pdbs
pdb_source = ['./1AK4/decoys/']


# where to find the native conformations
# pdb_native is only used to calculate i-RMSD, dockQ and so on.
# The native pdb files will not be saved in the hdf5 file
pdb_native = ['./1AK4/native/']


# where to find the pssm
pssm_source = './1AK4/pssm_new/'


# initialize the database
database = DataGenerator(
    chain1='C', chain2='D',
    pdb_source=pdb_source,
    pdb_native=pdb_native,
    pssm_source=pssm_source,
    data_augmentation=0,
    compute_targets=[
        'deeprank.targets.dockQ',
        'deeprank.targets.binary_class'],
    compute_features=[
        'deeprank.features.AtomicFeature',
        'deeprank.features.FullPSSM',
        'deeprank.features.PSSM_IC',
        'deeprank.features.BSA',
        'deeprank.features.ResidueDensity'],
    hdf5=h5file,
    mpi_comm=comm)


# create the database
# compute features/targets for all complexes
database.create_database(prog_bar=True)


# define the 3D grid
 grid_info = {
   'number_of_points': [30,30,30],
   'resolution': [1.,1.,1.],
   'atomic_densities': {'C': 1.7, 'N': 1.55, 'O': 1.52, 'S': 1.8},
 }

# Map the features
database.map_features(grid_info,try_sparse=True, time=False, prog_bar=True)

This script can be exectuted using for example 4 MPI processes with the command:

    NP=4
    mpiexec -n $NP python generate.py

In the first part of the script we define the path where to find the PDBs of the decoys and natives that we want to have in the dataset. All the .pdb files present in pdb_source will be used in the dataset. We need to specify where to find the native conformations to be able to compute RMSD and the dockQ score. For each pdb file detected in pdb_source, the code will try to find a native conformation in pdb_native.

We then initialize the DataGenerator object. This object (defined in deeprank/generate/DataGenerator.py) needs a few input parameters:

  • pdb_source: where to find the pdb to include in the dataset
  • pdb_native: where to find the corresponding native conformations
  • compute_targets: list of modules used to compute the targets
  • compute_features: list of modules used to compute the features
  • hdf5: Name of the HDF5 file to store the data set

We then create the data base with the command database.create_database(). This function autmatically create an HDF5 files where each pdb has its own group. In each group we can find the pdb of the complex and its native form, the calculated features and the calculated targets. We can now mapped the features to a grid. This is done via the command database.map_features(). As you can see this method requires a dictionary as input. The dictionary contains the instruction to map the data.

  • number_of_points: the number of points in each direction
  • resolution: the resolution in Angs
  • atomic_densities: {'atom_name': vvdw_radius} the atomic densities required

The atomic densities are mapped following the protein-ligand paper. The other features are mapped to the grid points using a Gaussian function (other modes are possible but somehow hard coded)

Visualization of the mapped features

To explore the HDf5 file and vizualize the features you can use the dedicated browser https://github.com/DeepRank/DeepXplorer. This tool saloows to dig through the hdf5 file and to directly generate the files required to vizualie the features in VMD or PyMol. An iPython comsole is also embedded to analyze the feature values, plot them etc ....

B . Deep Learning

The HDF5 files generated above can be used as input for deep learning experiments. You can take a look at the file test/test_learn.py for some examples. We give here a quick overview of the process.

from deeprank.learn import *
from deeprank.learn.model3d import cnn_reg
import torch.optim as optim
import numpy as np

# input database
database = '1ak4.hdf5'

# output directory
out = './my_DL_test/'

# declare the dataset instance
data_set = DataSet(database,
            chain1='C',
            chain2='D',
            grid_info={
                'number_of_points': (10, 10, 10),
                'resolution': (3, 3, 3)},
            select_feature={
                'AtomicDensities': {'C': 1.7, 'N': 1.55, 'O': 1.52, 'S': 1.8},
                'Features': ['coulomb', 'vdwaals', 'charge', 'PSSM_*']},
            select_target='DOCKQ',
            normalize_features = True, normalize_targets=True,
            pair_chain_feature=np.add,
            dict_filter={'DOCKQ':'<1'})


# create the network
model = NeuralNet(data_set,cnn_reg,model_type='3d',task='reg',
                  cuda=False,plot=True,outdir=out)

# change the optimizer (optional)
model.optimizer = optim.SGD(model.net.parameters(),
                            lr=0.001,
                            momentum=0.9,
                            weight_decay=0.005)

# start the training
model.train(nepoch = 50,divide_trainset=0.8, train_batch_size = 5,num_workers=0)

In the first part of the script we create a Torch database from the HDF5 file. We can specify one or several HDF5 files and even select some conformations using the dict_filter argument. Other options of DataSet can be used to specify the features/targets the normalization, etc ...

We then create a NeuralNet instance that takes the dataset as input argument. Several options are available to specify the task to do, the GPU use, etc ... We then have simply to train the model. Simple !

Issues and Contributing

If you have questions or find a bug, please report the issue in the Github issue channel.

If you want to change or further develop DeepRank code, please check the Developer Guideline to see how to conduct further development.

iscore's People

Contributors

abelsiqueira avatar cunlianggeng avatar dependabot-preview[bot] avatar lilysnow avatar nicorenaud avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

iscore's Issues

how to uninstall iscore

I installed iscore in one location. Then I deleted it and wanted to install it in another location. But I got the following error message when running pip install -e ./:

(base) [lixue@alcazar iScore]$ pip install -e ./
Obtaining file:///home/lixue/tools/iScore
Requirement already satisfied: numpy>=1.13 in /home/lixue/tools/anaconda3/lib/python3.6/site-packages (from iScore==0.0) (1.16.2)
Requirement already satisfied: scipy in /home/lixue/tools/anaconda3/lib/python3.6/site-packages (from iScore==0.0) (1.2.1)
Installing collected packages: iScore
Found existing installation: iScore 0.0
Exception:
Traceback (most recent call last):
File "/home/lixue/tools/anaconda3/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 179, in main
status = self.run(options, args)
File "/home/lixue/tools/anaconda3/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 393, in run
use_user_site=options.use_user_site,
File "/home/lixue/tools/anaconda3/lib/python3.6/site-packages/pip/_internal/req/init.py", line 50, in install_given_reqs
auto_confirm=True
File "/home/lixue/tools/anaconda3/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 816, in uninstall
uninstalled_pathset = UninstallPathSet.from_dist(dist)
File "/home/lixue/tools/anaconda3/lib/python3.6/site-packages/pip/_internal/req/req_uninstall.py", line 505, in from_dist
'(at %s)' % (link_pointer, dist.project_name, dist.location)
AssertionError: Egg-link /data/lixue/capri/Capri47/T160/test_iscore/iScore does not match installed location of iScore (at /home/lixue/tools/iScore)

error when some pdb files cannot have graph generated

When some pdb files cannot have graph generated, we will have the error message below. In this situation, all the pdb files have haddock energy (Energy.dat) and some models are missing from graphrank file (GraphRank.dat).

Traceback (most recent call last):
File "/home/lixue/tools/iScore/bin/iScore.predict", line 63, i
iscore()
File "/home/lixue/tools/iScore/iScore/function/score.py", line
self.read_graphrank()
File "/home/lixue/tools/iScore/iScore/function/score.py", line
data.append(self.features[m]['grank'])
KeyError: 'grank'

This bug can be fixed by switching the order of the following two lines (quick and dirty):

From

          self.read_graphrank()
         self.read_energy()

to

          self.read_energy()
          self.read_graphrank()

UnpicklingError: could not find MARK

Describe the bug
A clear and concise description of what the bug is. UnpicklingError: could not find MARK occurs when attempting to run iScore.predict using the trained model.

Environment:

  • OS system: Ubuntu
  • Version: 20.04
  • Branch commit ID:
  • Inputs:

To Reproduce
Steps/commands to reproduce the behaviour:

  1. created pdb and pssm folders under iScore/test directory
  2. typed either iScore.predict or mpiexec -n iScore.predict.mpi within the test directory (both gave the same error)

Expected Results
A. Expected the creating of iScore.predict.txt file

Actual Results or Error Info
nick@nick-HP-ZBook-15-G5:~/iScore/test$ iScore.predict
/home/nick/.local/lib/python3.8/site-packages/skcuda/cublas.py:284: UserWarning: creating CUBLAS context to get version number
warnings.warn('creating CUBLAS context to get version number')
Reusing graphs in ./graph/
Reusing kernels in ./kernel/
Traceback (most recent call last):
File "/home/nick/.local/bin/iScore.predict", line 7, in
exec(compile(f.read(), file, 'exec'))
File "/home/nick/iScore/bin/iScore.predict", line 68, in
iscore_svm(load_model=None,package_name=args.archive,maxlen=args.maxlen,testID=args.ground_truth)
File "/home/nick/iScore/iScore/graphrank/rank.py", line 400, in iscore_svm
testdata = DataSet(trainID,Kfile,maxlen,testID=testID)
File "/home/nick/iScore/iScore/graphrank/rank.py", line 39, in init
self.get_K_matrix()
File "/home/nick/iScore/iScore/graphrank/rank.py", line 104, in get_K_matrix
K.update(pickle.load(open(f,'rb')))
_pickle.UnpicklingError: could not find MARK

Additional Context
Example of pdb and pssm files attached.
1084_TS029_1o.zip

h5x support does not work on windows

Describe the bug
On Windows the path of all the pdbs generated in loadData.py should contain double back slash in the path : C:\Users\Name\ ..... instead of single ones. Also the path in the example/h5x/graphs.hdf5 are local to my own machine ...

Environment:

  • OS system: Window
  • Version: lastest
  • Branch commit ID: master

To Reproduce
Steps/commands to reproduce the behaviour:

  1. create the hdf5 graph with example/kernel/create_kernel.py
  2. launch h5x : bin/iScore.h5x
  3. right click on mol and select PyMol

Expected Results
A nice PyMol viz of the interface

Actual Results or Error Info
Unicode error due to the single \ when loading the pdbs

Additional Context
Add any other context about the problem here.

the pssm class hard-coded the file structure

The pssm class requires the following folder structure:

--working_dir
   |
    --caseID
        |
         -- pdb
             |
             --- pdb files

We can either write this requirement in the doc or change the code to be more flexible.
Which option should we go for?

fail to repeat 684 dataset result

Describe the bug
when I run iScore.predict on one of data sets, such as 1ACB, different iScores were given. In 684, the file refers to 1ACB.iScore. In the result of iScore.predict, the file refers to iScorePredict.dat

Environment:

  • OS system: Linux VM-88-194-centos 3.10.107-1-tlinux2_kvm_guest-0049
  • Version:
  • Branch commit ID: lastest
  • Inputs: the same pdb and pssm in 684

To Reproduce
Steps/commands to reproduce the behaviour:

  1. iScore.predict

Expected Results
Ideally, I want to get the same result files which were provided in 684 datasets including haddock energy scores.

About the outputs

Hi, thanks for the great work! I am wondering which results are the final iScore. Is it the last column in ScorePredict.dat? Many thanks :)

provide training_set.tar.gz

I could not find the training_set.tar.gz from BM5 dimers in our iscore repository. Could you please tell me where it is? Most users do not want to retrain iscore and simply want to use it

To update docs for code v0.3.4

Docs should be updated to be consistent with the code of version 0.3.4.

  • update the tutorial
  • update how to use commands and python modules for generating graph and kernel, training and test.
  • organise the code documentation

issues of iscore doc

I am following iScore workflow and got the error message below:

(base) [lixue@alcazar test]$ mpiexec -n 2 iScore.predict.mpi --archive ../train/training_set.tar.gz
Warning : pycuda not found
Warning : scikit-cuda not found
Warning : pycuda not found
Warning : scikit-cuda not found
Traceback (most recent call last):
File "/home/lixue/tools/iScore/bin/iScore.predict.mpi", line 81, in
iscore_svm(load_model=None,package_name=args.archive,testID=args.ground_truth)
File "/home/lixue/tools/iScore/iScore/graphrank/rank.py", line 401, in iscore_svm
testdata = DataSet(trainID,Kfile,maxlen,testID=testID)
File "/home/lixue/tools/iScore/iScore/graphrank/rank.py", line 39, in init
self.get_K_matrix()
File "/home/lixue/tools/iScore/iScore/graphrank/rank.py", line 112, in get_K_matrix
key = list(K.keys())[1]
IndexError: list index out of range

Could we make sure that GraphRank score is the lower the better

In GraphRank.dat file, it currently reports:#Name label pred decision_value

Could we report: #Name label pred decision_value GraphRank_score, where we make sure the GraphRank_score is the lower the better. In the current paper, we use “native is labeled as class 1 and non-native 0, positive decision_value is for class 1”.

install blast and its db in setup.py

There are 4 steps in total: 1. download psiblast. 2. download the blast nr database (we can provide it so that the results are reproducible). 3. run psiblast to generate pssm 4. reformat pssm to a format that iscore uses

need the PSSM class to allow input sequence

The current PSSM class takes a pdb file as input and extracts sequences from the pdb file. However, many pdb files do not have a complete sequence (missing segments) and this will affect the quality of pssm.

Can we make the pssm class to allow the user input sequences?

a bug with "mpiexec -n ${NPROC} iScore.predict"

When I run "mpiexec -n ${NPROC} iScore.predict" on capri target 161, I ran into a bug.

  1. I run this command for the first time, I will get the following error message:

Reusing kernels in ./kernel/
Traceback (most recent call last):
File "/home/lixue/tools/iScore/bin/iScore.predict", line 57, in
iscore_svm(load_model=None,package_name=args.archive,testID=args.ground_truth)
File "/home/lixue/tools/iScore/iScore/graphrank/rank.py", line 402, in iscore_svm
testdata = DataSet(trainID,Kfile,maxlen,testID=testID)
File "/home/lixue/tools/iScore/iScore/graphrank/rank.py", line 29, in init
self.test_name, self.test_class = self._get_ids(testID)
File "/home/lixue/tools/iScore/iScore/graphrank/rank.py", line 80, in _get_ids
nc = len(idlist[0])
IndexError: list index out of range

The folder of 'graph' will be calculated successfully but 'kernel' folder will not.

  1. I then deleted the folder of 'kernel', and rerun the same command. This time the kernel folder is calculated correctly. But I got the following error message:

Traceback (most recent call last):
File "/home/lixue/tools/iScore/bin/iScore.predict", line 63, in
iscore()
File "/home/lixue/tools/iScore/iScore/function/score.py", line 17, in init
self.read_graphrank()
File "/home/lixue/tools/iScore/iScore/function/score.py", line 56, in read_graphran
data.append(self.features[m]['grank'])
KeyError: 'grank'

  1. Then I ran the same command the 3rd time, iScorePredict.dat is finally generated.

The data is on alcazar: /home/lixue/CAPRI/Capri48/t161/iscore/before_cleaning/results_li

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.