Giter Site home page Giter Site logo

deeprank / deeprank Goto Github PK

View Code? Open in Web Editor NEW
145.0 145.0 27.0 107.97 MB

This repository has been integrated in https://github.com/DeepRank/deeprank2

License: Apache License 2.0

Python 98.75% C 0.35% Shell 0.08% Makefile 0.18% R 0.64%
3d-cnn docking protein-protein-interaction pytorch

deeprank's Introduction

⚠️ Archiving Note

This repository is no longer being maintained and has been archived for historical purposes.

We have now developed DeepRank2, an improved and unified version of DeepRank, DeepRank-GNN, and DeepRank-Mut.

✨ DeepRank2 allows for transformation and storage of 3D representations of both protein-protein interfaces (PPIs) and protein single-residue variants (SRVs) into either graphs or volumetric grids containing structural and physico-chemical information. These can be used for training neural networks for a variety of patterns of interest, using either our pre-implemented training pipeline for graph neural networks (GNNs) or convolutional neural networks (CNNs) or external pipelines.

We look forward to seeing you in our new space - DeepRank2!

DeepRank

PyPI Documentation Status DOI Build Codacy Badge Coverage Status

Contents

Overview

alt-text

DeepRank is a general, configurable deep learning framework for data mining protein-protein interactions (PPIs) using 3D convolutional neural networks (CNNs).

DeepRank contains useful APIs for pre-processing PPIs data, computing features and targets, as well as training and testing CNN models.

Features:

  • Predefined atom-level and residue-level PPI feature types
    • e.g. atomic density, vdw energy, residue contacts, PSSM, etc.
  • Predefined target types
    • e.g. binary class, CAPRI categories, DockQ, RMSD, FNAT, etc.
  • Flexible definition of both new features and targets
  • 3D grid feature mapping
  • Efficient data storage in HDF5 format
  • Support both classification and regression (based on PyTorch)

Installation

DeepRank requires a Python version 3.7 or 3.8 on Linux and MacOS. Make sure that mpi4py is installed in your environment before installing deeprank: conda install mpi4py

Stable Release

DeepRank is available in stable releases on PyPI:

  • Install the module pip install deeprank

Development Version

You can also install the under development source code from the branch development

  • Clone the repository git clone --branch development https://github.com/DeepRank/deeprank.git
  • Go there cd deeprank
  • Install the package pip install -e ./

To check if installation is successful, you can run a test

  • Go into the test directory cd test
  • Run the test suite pytest

Tutorial

We give here the tutorial like introduction to the DeepRank machinery. More informatoin can be found in the documentation http://deeprank.readthedocs.io/en/latest/. We quickly illsutrate here the two main steps of Deeprank:

  • the generation of the data
  • running deep leaning experiments.

A . Generate the data set (using MPI)

The generation of the data require only require PDBs files of decoys and their native and the PSSM if needed. All the features/targets and mapped features onto grid points will be auomatically calculated and store in a HDF5 file.

from deeprank.generate import *
from mpi4py import MPI

comm = MPI.COMM_WORLD

# let's put this sample script in the test folder, so the working path will be deeprank/test/
# name of the hdf5 to generate
h5file = './hdf5/1ak4.hdf5'

# for each hdf5 file where to find the pdbs
pdb_source = ['./1AK4/decoys/']


# where to find the native conformations
# pdb_native is only used to calculate i-RMSD, dockQ and so on.
# The native pdb files will not be saved in the hdf5 file
pdb_native = ['./1AK4/native/']


# where to find the pssm
pssm_source = './1AK4/pssm_new/'


# initialize the database
database = DataGenerator(
    chain1='C', chain2='D',
    pdb_source=pdb_source,
    pdb_native=pdb_native,
    pssm_source=pssm_source,
    data_augmentation=0,
    compute_targets=[
        'deeprank.targets.dockQ',
        'deeprank.targets.binary_class'],
    compute_features=[
        'deeprank.features.AtomicFeature',
        'deeprank.features.FullPSSM',
        'deeprank.features.PSSM_IC',
        'deeprank.features.BSA',
        'deeprank.features.ResidueDensity'],
    hdf5=h5file,
    mpi_comm=comm)


# create the database
# compute features/targets for all complexes
database.create_database(prog_bar=True)


# define the 3D grid
 grid_info = {
   'number_of_points': [30,30,30],
   'resolution': [1.,1.,1.],
   'atomic_densities': {'C': 1.7, 'N': 1.55, 'O': 1.52, 'S': 1.8},
 }

# Map the features
database.map_features(grid_info,try_sparse=True, time=False, prog_bar=True)

This script can be exectuted using for example 4 MPI processes with the command:

    NP=4
    mpiexec -n $NP python generate.py

In the first part of the script we define the path where to find the PDBs of the decoys and natives that we want to have in the dataset. All the .pdb files present in pdb_source will be used in the dataset. We need to specify where to find the native conformations to be able to compute RMSD and the dockQ score. For each pdb file detected in pdb_source, the code will try to find a native conformation in pdb_native.

We then initialize the DataGenerator object. This object (defined in deeprank/generate/DataGenerator.py) needs a few input parameters:

  • pdb_source: where to find the pdb to include in the dataset
  • pdb_native: where to find the corresponding native conformations
  • compute_targets: list of modules used to compute the targets
  • compute_features: list of modules used to compute the features
  • hdf5: Name of the HDF5 file to store the data set

We then create the data base with the command database.create_database(). This function autmatically create an HDF5 files where each pdb has its own group. In each group we can find the pdb of the complex and its native form, the calculated features and the calculated targets. We can now mapped the features to a grid. This is done via the command database.map_features(). As you can see this method requires a dictionary as input. The dictionary contains the instruction to map the data.

  • number_of_points: the number of points in each direction
  • resolution: the resolution in Angs
  • atomic_densities: {'atom_name': vvdw_radius} the atomic densities required

The atomic densities are mapped following the protein-ligand paper. The other features are mapped to the grid points using a Gaussian function (other modes are possible but somehow hard coded)

Visualization of the mapped features

To explore the HDf5 file and vizualize the features you can use the dedicated browser https://github.com/DeepRank/DeepXplorer. This tool saloows to dig through the hdf5 file and to directly generate the files required to vizualie the features in VMD or PyMol. An iPython comsole is also embedded to analyze the feature values, plot them etc ....

B . Deep Learning

The HDF5 files generated above can be used as input for deep learning experiments. You can take a look at the file test/test_learn.py for some examples. We give here a quick overview of the process.

from deeprank.learn import *
from deeprank.learn.model3d import cnn_reg
import torch.optim as optim
import numpy as np

# input database
database = '1ak4.hdf5'

# output directory
out = './my_DL_test/'

# declare the dataset instance
data_set = DataSet(database,
            chain1='C',
            chain2='D',
            grid_info={
                'number_of_points': (10, 10, 10),
                'resolution': (3, 3, 3)},
            select_feature={
                'AtomicDensities': {'C': 1.7, 'N': 1.55, 'O': 1.52, 'S': 1.8},
                'Features': ['coulomb', 'vdwaals', 'charge', 'PSSM_*']},
            select_target='DOCKQ',
            normalize_features = True, normalize_targets=True,
            pair_chain_feature=np.add,
            dict_filter={'DOCKQ':'<1'})


# create the network
model = NeuralNet(data_set,cnn_reg,model_type='3d',task='reg',
                  cuda=False,plot=True,outdir=out)

# change the optimizer (optional)
model.optimizer = optim.SGD(model.net.parameters(),
                            lr=0.001,
                            momentum=0.9,
                            weight_decay=0.005)

# start the training
model.train(nepoch = 50,divide_trainset=0.8, train_batch_size = 5,num_workers=0)

In the first part of the script we create a Torch database from the HDF5 file. We can specify one or several HDF5 files and even select some conformations using the dict_filter argument. Other options of DataSet can be used to specify the features/targets the normalization, etc ...

We then create a NeuralNet instance that takes the dataset as input argument. Several options are available to specify the task to do, the GPU use, etc ... We then have simply to train the model. Simple !

Issues and Contributing

If you have questions or find a bug, please report the issue in the Github issue channel.

If you want to change or further develop DeepRank code, please check the Developer Guideline to see how to conduct further development.

deeprank's People

Contributors

cunlianggeng avatar danibodor avatar dariomarzella avatar dependabot-preview[bot] avatar francesco03 avatar gcroci2 avatar heleensev avatar lilysnow avatar manonreau avatar nicorenaud avatar ridderl avatar sonjageorgievska avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deeprank's Issues

normalize_targets=False results in predictions of the same value

If I set "normalize_targets=False", the predictions of deeprank for all test models are the same values (except for epoch_0000):

DeepRank output:
[ 0.26067629 0.26067629 0.26067629 0.26067629 0.26067629 0.26067629
0.26067629 0.26067629 0.26067629 0.26067629 0.26067629 0.26067629
.....]

Here is my code:

path = '/home/deep/projects/deeprank/BM4/'

test_files = ['1E6E.hdf5']
train_database = [ f for f in glob.glob(path + '*.hdf5') if ('1E6E.hdf5' not in f) and ('native' not in f) and ('1
test_database = [path+f for f in test_files]

data_set = DataSet(train_database,
test_database = test_database,
grid_shape = (30,30,30),
# select_feature= 'all',
select_feature = {'Feature_ind' : ['pssm_ic']},
pair_chain_feature=np.add,
select_target='DOCKQ',
# does dict_filter filter only the training data
dict_filter = {'DOCKQ' : '>0.1'},
normalize_features=True,normalize_targets=False,tqdm=False)

create the network

model = NeuralNet(data_set, model3d.cnn,cuda=True,ngpu=1,plot=True)

change the optimizer (optional)

model.optimizer = optim.SGD(model.net.parameters(),
lr=0.001,momentum=0.9,weight_decay=0.005)

start the training

model.train(nepoch = 30,train_batch_size = 200,num_workers=8,divide_trainset=[0.8,0.2])
model.save_model()

Travis is broken !

Get an error message on travis about install of libxml and json schemer .... not sure we need those anymore but it needs fixing !

Add batchnorm

To add batchnorm to modelGenerator.py.

This layer is not that necessary considering we can normlize features beforehand. I close this issue with no changes of code.

a bug: 2D + pair_chain_feature

The code crashes when I use 2D convolution with pair_chain_feature.

Error message:

Traceback (most recent call last):
File "learn_2d.py", line 35, in
model.train(nepoch = 1,train_batch_size = 2,num_workers=8,divide_trainset=[0.8,0.2])
File "/home/lixue/deeprank/BM4/xper/all/exp001/deeprank/deeprank/learn/NeuralNet.py", line 276, in train
num_workers=num_workers)
File "/home/lixue/deeprank/BM4/xper/all/exp001/deeprank/deeprank/learn/NeuralNet.py", line 456, in _train
self.valid_loss,self.data['valid'] = self._epoch(valid_loader,train_model=False)
File "/home/lixue/deeprank/BM4/xper/all/exp001/deeprank/deeprank/learn/NeuralNet.py", line 537, in _epoch
outputs = self.net(inputs)
File "/home/lixue/softwares/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in call
result = self.forward(*input, **kwargs)
File "/home/lixue/deeprank/BM4/xper/all/exp001/model2d.py", line 55, in forward
x = self._forward_features(x)
File "/home/lixue/deeprank/BM4/xper/all/exp001/model2d.py", line 47, in _forward_features
x = F.relu(self.convlayer2D_000(x))
File "/home/lixue/softwares/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in call
result = self.forward(*input, **kwargs)
File "/home/lixue/softwares/anaconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 254, in forward
self.padding, self.dilation, self.groups)
File "/home/lixue/softwares/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 52, in conv2d
return f(input, weight, bias)
RuntimeError: Need input.size[1] == 30 but got 1 instead.

Part of my code:

data_set = DataSet(train_files,
test_database = test_database,
grid_shape = (30,30,30),
# select_feature= 'all',
select_feature = {'Feature_ind' : ['pssm_ic']},
pair_chain_feature=np.add,
select_target='DOCKQ',
dict_filter = {'DOCKQ':'>0.1'},
normalize_features=True,normalize_targets=True,tqdm=False)

My code location: /home/lixue/deeprank/BM4/xper/all/exp001/learn_2d.py

add manual random seed for dataset preshuffle

add preshuffle_seed option to NeuralNet.py. By giving a fixed seed value, it will generate same training/validation/test data for different trainings, which is useful for optimising hyperparameters, e.g. batch size.

python 3.7.1 does not work with pytorch 0.4.0

After installing Python 3.7.1, I tried to install pytorch as below:

conda install pytorch=0.4.0 cuda90 -c pytorch

Solving environment: failed
UnsatisfiableError: The following specifications were found to be in conflict:
  - anaconda==2018.12=py37_0 -> mkl-service==1.1.2=py37he904b0f_5
  - anaconda==2018.12=py37_0 -> numexpr==2.6.8=py37h9e4a6bb_0
  - anaconda==2018.12=py37_0 -> scikit-learn==0.20.1=py37hd81dba3_0
  - pytorch=0.4.0
Use "conda info <package>" to see the dependencies for each package.

update to pytorch 1.0

we're still using pytorch 0.4
to benefit from the fancy new developments (new layers, parallelization, .... ) we need pytorch 1.0

two minor issues with DataGenerator

[the following issues are mentioned by Nico. I report here in case we forget.]

When I use DataGenerator to add new target values (binary class) to an existing hdf5, I had the following two problems.

  1. If a hdf5 does not exist, DataGenerator(compute_targets='...') will generate an empty hdf5. No error reported.
  2. After the new target values are added to the hdf5 file, we cannot add it again. The code will throw the following errors:

Traceback (most recent call last):
File "add_binaryClass.py", line 19, in
data_set.add_target()
File "/home/lixue/deeprank/deeprank/deeprank/generate/DataGenerator.py", line 451, in add_target
self._compute_targets(self.compute_targets, molgrp['complex'].value,molgrp['targets'])
File "/home/lixue/deeprank/deeprank/deeprank/generate/DataGenerator.py", line 970, in _compute_targets
targ_module.compute_target(pdb_data,targrp)
File "/home/lixue/deeprank/deeprank/deeprank/targets/binary_class.py", line 21, in compute_target
targrp.create_dataset('BIN_CLASS',data=np.array(classID))
File "/home/lixue/softwares/anaconda3/lib/python3.6/site-packages/h5py/_hl/group.py", line 111, in create_dataset
self[name] = dset
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/home/lixue/softwares/anaconda3/lib/python3.6/site-packages/h5py/_hl/group.py", line 276, in setitem
h5o.link(obj.id, self.id, name, lcpl=lcpl, lapl=self._lapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 202, in h5py.h5o.link
RuntimeError: Unable to create link (Name already exists)

My code add_binaryClass.py is here: /home/lixue/deeprank/BM4/test/add_binaryClass.py

GPU support for the mapping

While we have GPU support for the mapping two issues are still there

  • the speed up is quite small (about twice faster on GPUs compared to CPUs)
  • GPU is not supported for multi value features
    We should revisit the GPU implementation of the feature calculation

Dataset augmentation on the fly

The augmentation of the data set is now done statically. We must generate extra conformation in the .hdf5 file by rotating the strucrure and mapping the features. This will lead to very large dataset. One option would be to augment the dataset on the fly during the training of the network. To do that we will have to rotate the feature and map them during the batch preparation. This can be done but might cost a lot of CPU time.

(semi)Bug: cannot add a feature that is already inside the hdf5 file

When trying to add a feature to a hdf5 file, which already has this feature, we will get the following error message:

======================================================================
ERROR: test_4_add_feature (main.TestGenerateData)
Add a feature to the database.

Traceback (most recent call last):
File "test_generate.py", line 102, in test_4_add_feature
database.add_feature(prog_bar=True)
File "/nfs/home6/lixue1/deeprank/deeprank/generate/DataGenerator.py", line 368, in add_feature
self._compute_features(self.compute_features, molgrp['complex'].value,molgrp['features'],molgrp['features_raw'] )
File "/nfs/home6/lixue1/deeprank/deeprank/generate/DataGenerator.py", line 974, in _compute_features
feat_module.compute_feature(pdb_data,featgrp,featgrp_raw)
File "/nfs/home6/lixue1/deeprank/deeprank/features/FullPSSM.py", line 201, in compute_feature
pssm.export_dataxyz_hdf5(featgrp)
File "/nfs/home6/lixue1/deeprank/deeprank/features/FeatureClass.py", line 97, in export_dataxyz_hdf5
featgrp.create_dataset(name,data=ds)
File "/home/lixue1/tools/anaconda3/lib/python3.7/site-packages/h5py/_hl/group.py", line 119, in create_dataset
self[name] = dset
File "/home/lixue1/tools/anaconda3/lib/python3.7/site-packages/h5py/_hl/group.py", line 287, in setitem
h5o.link(obj.id, self.id, name, lcpl=lcpl, lapl=self._lapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 202, in h5py.h5o.link
RuntimeError: Unable to create link (name already exists)


Code:

            database = DataGenerator(pdb_source=None,pdb_native=None,data_augmentation=None,
                                     pssm_source='./1AK4/pssm_new/',
                                     compute_features  = ['deeprank.features.FullPSSM'], hdf5=h5)

Testing

A lot of testing is required. A few tests are done with nosetest. We should move to unitest and cover the code much better.

Note : The CUDA and pyTorch parts can't be tested on Travis (no GPU on travis and pytorch fails to import).

Is there any reason why we flatten the output in NeuralNet.py

Is there any reason why we flatten the output? When do classification, the predictions for multi-classes are also flattened.

Here is the code that I am talking about in NeuralNet.py:
! #data['outputs'] = np.array(data['outputs']).flatten() ! #data['targets'] = np.array(data['targets']).flatten()

shall we try to use informative variable names?

Shall we try to use informative variable names for better team work, like k, v, kk, vv below?

  def _export_epoch_hdf5(self,epoch,data):
      """Export the epoch data to the hdf5 file

      Args:
          epoch (int): index of the epoch
          data (dict): data of the epoch
      """
      grp_name = 'epoch_%04d' %epoch
      grp = self.f5.create_group(grp_name)
      grp.attrs['type'] = 'epoch'
        for k,v in data.items():
          try:
              sg = grp.create_group(k)
                for kk,vv in v.items():

[bug] GLY are excluded from contacts in FullPSSM.py

FullPSSM.py extract position of atom CB for each residue xyz_info = sql.get('chainID,resSeq,resName',name='CB'), but residue glycine does not have CB.

All the GLYs at the interface are “secretly” reported to self.debug:

if tuple(res) in xyz_dict:
    do something
 else:
    printif([tuple(res), ' not found in the pdbfile'],self.debug)

Matrix definition of the strucrture similarity

Sonja proposed to replace the i-rmsd and l-rmsd by a matrix based metrics.
For example for l-rmsd one can align the long chains and compute the rotation/translation that would overlap the short chains. That would replace the rmsd as a metrics for similarity

Do the rotation on the fly

Could we find a way to only store the raw features for the complexes and augment the dataset dynamically during the learning ?

At the moment we need to specify how many rotation we want to store and they are statically stored in the hdf5. This seems a bit silly

Feature calculations from PDB

As discussed on Friday, we should work out an efficient way to create all the grids. Thanks to Li, for sharing the code for calculating atom pairwise electrostatics and VDW terms. It looks good, but one concern/question I have is that a lot of intermediate file reading and writing seem to be done, which might not be efficient when starting with a complete new dataset and potentially uses a lot of disc space?

If you agree, I wonder if it would be worthwhile to invest some time in creating a (python) library/class that reads a PDB once and has all the methods to perform all the feature calculations we need and then writes the grids for deeplearning directly. It could also include a method that checks which grids are still missing (for a given model) and only calculates those.

This would make sense if most of the features we need are going to be calculated by our own code. Then it would help to read the PDB only once. If we call many external codes to calculate features and that read a PDB anyway, we may go for a different design.
So probably a good first step is to make an inventory of the features we think we will need (Table 1 from the proposal) + whether we can easily calculate them ourselves or if we need to call external packages for those.

@LilySnow, @NicoRenaud, What do you think?

1E6E.hdf5 has only 219 models

1E6E.hdf5 has only 219 models and 400 models should be there. According to Nico, "when the calculation of the feature or their mapping fails I remove the model from the hdf5 file".

_plot_boxplot_class is missing from NeuralNet.py

When I do classification, I ran into this bug:

Traceback (most recent call last):
File "learn.py", line 33, in
model = NeuralNet(data_set, model3d.cnn,cuda=True,ngpu=1,plot=True, task='class')
File "/home/lixue/deeprank/BM4/xper/all/exp002_2L_Cl/deeprank/learn/NeuralNet.py", line 144, in init
self._plot_scatter = self._plot_boxplot_class
AttributeError: 'NeuralNet' object has no attribute '_plot_boxplot_class'

the development` branch is broken on classification

Error message:

Traceback (most recent call last):
File "learn.py", line 44, in
model.train(nepoch = 5, divide_trainset = None, train_batch_size = 5, num_workers=0, save_model='all')
File "/nfs/home6/lixue1/deeprank/deeprank/learn/NeuralNet.py", line 336, in train
save_model=save_model)
File "/nfs/home6/lixue1/deeprank/deeprank/learn/NeuralNet.py", line 596, in _train
self.valid_loss,self.data['valid'] = self._epoch(valid_loader,train_model=False)
File "/nfs/home6/lixue1/deeprank/deeprank/learn/NeuralNet.py", line 763, in _epoch
data['hit'] = self._get_relevance(data)
File "/nfs/home6/lixue1/deeprank/deeprank/learn/NeuralNet.py", line 1084, in _get_relevance
out = F.softmax(torch.FloatTensor(out), dim=1).data.numpy()[:,1]
IndexError: index 1 is out of bounds for axis 1 with size 1

When two proteins are far from each other, got error msg from ResidueDensity.py

Error message:

Create database : 0%| | 7/25301 [00:09<9:34:43, 1.36s/it, mol=2I25_ranair-itw_387w.pdb]Error during the feature calculation of /dev/shm/2I25/2I25/2I25_ranair-itw_387w.pdb
Traceback (most recent call last):
File "/nfs/home6/lixue1/deeprank/deeprank/generate/DataGenerator.py", line 282, in create_database
molgrp['features_raw'] )
File "/nfs/home6/lixue1/deeprank/deeprank/generate/DataGenerator.py", line 1103, in _compute_features
feat_module.compute_feature(pdb_data,featgrp,featgrp_raw)
File "/nfs/home6/lixue1/deeprank/deeprank/features/ResidueDensity.py", line 160, in compute_feature
resdens.export_data_hdf5(featgrp_raw)
File "/nfs/home6/lixue1/deeprank/deeprank/features/FeatureClass.py", line 64, in export_data_hdf5
ds = np.array(ds).astype('|S'+str(len(ds[0])))
IndexError: list index out of range

Such pdb files should be properly reported back to the user.

Improve the normalization of the features

It is not clear how the normalization is done unless I read the code line by line. Is it done over mini-batch or batch or one case? Could we add more information to it? An equation?

Improve HitRate Plot

My experiment setting: 3D conv, dict_filter = {'DOCKQ':'>0.1'}, feature with pssm_ic only, regression

And I got the following strange hit rate plot:
image

Here is the scatter plot of predicted DockQ vs. real DockQ:
image

My code (also can be found on alembick: /home/lixue/deeprank/BM4/xper/all/exp001/learn.py and model3d.py):

from deeprank.learn import *
import model3d
import torch.optim as optim
import os
import numpy as np
import glob

declare the dataset instance

path = '/home/deep/projects/deeprank/BM4/'

test_files = ['1E6E.hdf5']
train_database = [ f for f in glob.glob(path + '*.hdf5') if '1E6E.hdf5' not in f]
test_database = [path+f for f in test_files]

data_set = DataSet(train_database,
test_database = test_database,
grid_shape = (30,30,30),
# select_feature= 'all',
select_feature = {'Feature_ind' : ['pssm_ic']},
pair_chain_feature=np.add,
select_target='DOCKQ',
# does dict_filter filter only the training data
dict_filter = {'DOCKQ':'>0.1'},
normalize_features=True,normalize_targets=True,tqdm=False)

create the network

model = NeuralNet(data_set, model3d.cnn,cuda=True,ngpu=1,plot=True)

change the optimizer (optional)

model.optimizer = optim.SGD(model.net.parameters(),
lr=0.001,momentum=0.9,weight_decay=0.005)

start the training

model.train(nepoch = 30,train_batch_size = 50,num_workers=8,divide_trainset=[0.8,0.2])
model.save_model()

The NN model file:

import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as data_utils

'''
definition of the Convolutional Networks

The model must take as an argument the input shape
This allows to automaically precompute
the input size of the first FC hidden layer

'''

class cnn(nn.Module):

def init(self,input_shape):
super(cnn,self).init()

  self.conv1 = nn.Conv3d(input_shape[0],4,kernel_size=2)
  self.pool  = nn.MaxPool3d((2,2,2))
  self.conv2 = nn.Conv3d(4,5,kernel_size=2)
  self.conv2_drop = nn.Dropout3d()

  size = self._get_conv_output(input_shape)

  self.fc1   = nn.Linear(size,84)
  self.fc2   = nn.Linear(84,1)

  self.sm = nn.Softmax()

def _get_conv_output(self,shape):
inp = Variable(torch.rand(1,*shape))
out = self._forward_features(inp)
return out.data.view(1,-1).size(1)

def _forward_features(self,x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
return x

def forward(self,x):

  x = self._forward_features(x)
  x = x.view(x.size(0),-1)
  x = F.relu(self.fc1(x))
  x = self.fc2(x)
  return x

initialize the neural network with weights from a pretrained network

The initialization of a nn is too random. So I would like to initialize it with reasonable weights from a pretrained model.

The pretrained models are generated by: torch.save(model.state_dict(), PATH) (recommended by pytorch)

So far our code works with the entire model saved by torch.save(model, PATH)

Log for the generation/mapping of the features

when the generation/mapping of a given conformation fails, the code skips that conformation and remove it after completion. It would be great to track these failures and have a good overview of why they failed.

adding the NaivePSSM feature to the database does not work

Data:
/projects/0/deeprank/PPI4DOCK/test

Code (test_generate.py under the same folder):
database = DataGenerator(pdb_source=None,pdb_native=None,data_augmentation=None, pssm_source='./pssm', compute_features = ['deeprank.features.NaivePSSM'], hdf5=h5)

Error message:

======================================================================
ERROR: test_4_add_feature (main.TestGenerateData)
Add a feature to the database.

Traceback (most recent call last):
File "test_generate.py", line 102, in test_4_add_feature
database.add_feature(prog_bar=True)
File "/nfs/home6/lixue1/deeprank/deeprank/generate/DataGenerator.py", line 368, in add_feature
self._compute_features(self.compute_features, molgrp['complex'].value,molgrp['features'],molgrp['features_raw'] )
File "/nfs/home6/lixue1/deeprank/deeprank/generate/DataGenerator.py", line 974, in _compute_features
feat_module.compute_feature(pdb_data,featgrp,featgrp_raw)
File "/nfs/home6/lixue1/deeprank/deeprank/features/NaivePSSM.py", line 201, in compute_feature
pssm.read_PSSM_data()
File "/nfs/home6/lixue1/deeprank/deeprank/features/NaivePSSM.py", line 83, in read_PSSM_data
raise FileNotFoundError('No PSSM file found for %s in %s',self.mol_name,self.pssm_path)
FileNotFoundError: [Errno No PSSM file found for %s in %s] 2w83-AB_20: '/nfs/home6/lixue1/deeprank/deeprank/features/PSSM/'


According to Nico, the NativePSSM feature should be removed.

Improve storage of the epoch data

@NicoRenaud Thanks for adding the model IDs. Could you please remove the path? This info is not needed.

From:
['/home/deep/projects/deeprank/BM4/1KAC.hdf5', '1KAC_101w'],

To:
'1KAC_101w'

How to best normalize the features ?

At the moment each channel is normalized so that the distribution of all the values of each channel for all the conformations are Gaussian distributed with mean 0 and std 1. Is it what we want considering that we have some features that are physically positive and negative ?

For example a conformation that would lead to only positive Coulomb interaction (very bad conformation ) could have pos/neg values after normalization. Same for a conf with only negative coulomb interaction (much better conformation) could have pos/neg values after normalization ..... That doesn't sound right to me

Plus doing the statistic over the entire dataset takes for ever

Should we normalize at all ?

Parallelize HDF5 generation

Can we do the same thing we've done for iScore and use mpi4py to parallelize the generation of each hdf5 ? Each mpi_process will only handle a portion of the pdbs. The hdf5 might have to be stiched back at the end ....

Normalization of added featues

When we add new features to the dataset we right now have to recompute the normalization of all the features. We should be able to add normalization info to the pickle file.

minor issues about hdf5 files

  1. Maybe we should not call the pdb file of a model as "native" in the BM4 hdf5 files (e.g., 1E6E.hdf5) and call it "pdb" instead:

In [8]: list(f['1E6E_9w'])
Out[8]:
['complex',
'features',
'features_raw',
'grid_points',
'mapped_features',
'native',
'targets']

  1. Shall we also put haddock score in the BM4 hdf5 files for easy comparison with haddock scoring function?

  2. In the final output data.hdf5 file, shall we also store the model IDs (currently it only contains target DockQ and predicted dockQ). The current version of data.hdf5 is not convenient for the comparison with other methods.

some changes of the hit rate plot?

  1. Can we plot hit rate curves from different epochs on one plot?
  2. Could we also plot hit rate up to top N, where N is the number of decoys for this test case?
  3. Can we also plot HADDOCK score on the hit rate plot so that we can compare with haddock? Haddock scores can be extracted from the pdb files of models. I have also extracted them and put them here: @alembick:/home/lixue/DBs/BM4/haddockScore/water

replace test() with model.eval()

Shall we replace test() in NeuralNet.py with model.eval()? The test() calls one time epoch. But if our model has dropout, the test won't be correct (maybe I am not familiar with the code enough. So I want to double check)?

Improve the data storage in the HDF5 files

The data in the mol.hdf5 file can be improved. We could for example :

  • use soft link for the pdbs (decoys and native). Warning : this would require some tweaking of pdb2sql
  • Change the names of the mapped features (don't need the _ind) and store all them in a single subgroup
    (i.e. do not separate atomic densities from the other features)

a minor bug in NeuralNet.py

When we do classification, the following line of NeuralNet.py:
loss = self.criterion(outputs,targets)
needs to be changed to
loss = self.criterion(outputs,targets.view(-1))

Otherwise, we will get “multi-target not supported” error.

Create a Raw PSSM feature

We want to have the raw PSSM (i.e. the 20 score for each residue) as a feature.
We can reuse the NaivePSSM feature as a code basis

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.