nh2tran / deepnovo Goto Github PK

View Code? Open in Web Editor NEW

93.0 11.0 41.0 178 KB

Protein Identification with Deep Learning

License: Other

Python 100.00%

deep-learning tensorflow sequencing mass-spectrometry dynamic-programming machine-learning peptide-identification

deepnovo's Introduction

DeepNovo

Latest update: De novo sequencing for both DDA and DIA.

Moving forward, we use a feature-based framework to unify DDA and DIA data analysis. As the data and model structure change, and to keep this DDA repository intact, we will maintain the new framework in a different repository: https://github.com/nh2tran/DeepNovo-DIA.
Publication: Deep learning enables de novo peptide sequencing from DIA mass spectrometry. Nature Methods, 2018. (https://www.nature.com/articles/s41592-018-0260-3)

Protein Identification with Deep Learning: from abc to xyz.

DeepNovo is a deep learning-based tool to address the problem of protein identification from tandem mass spectrometry data. The core model applies convolutional neural networks and recurrent neural networks to predict the amino acid sequence of a peptide from its spectrum, a similar task to generating a caption from an image. We combine two techniques, de novo sequencing and database search, into a single deep learning framework for peptide identification, and use de Bruijn graph to assemble peptides into protein sequences.

More details are available in our publications:

Protein identification with deep learning: from abc to xyz. arXiv:1710.02765, 2017.
De novo peptide sequencing by deep learning. Proceedings of the National Academy of Sciences, 2017.
Complete de novo assembly of monoclonal antibody sequences. Scientific Reports, 2016.

If you want to use the models in our PNAS paper, please use the branch PNAS.

Update version 0.0.1

The first-ever hybrid tool for peptide identification that performs de novo sequencing and database search under the same scoring and sequencing framework. DeepNovo now have three sequencing modes: search_denovo(), search_db(), and search_hybrid().
Added decoy database search to estimate False Discovery Rate (FDR). The FDR can be used to filter both database search and de novo sequencing results.
Replaced DecodingModel by ModelInference to make the code of building neural network models easy to understand and for further development.

We have decided to still use low-level functions of TensorFlow to construct neural networks. We think they could help to get better understanding of the basic details of our model and how to improve it. The network architecture is not so complicated, so the code is not too messy even with low-level functions. We will eventually update with high-level ones such as tf.layers and others.

We have added the database search function into DeepNovo. Both modules de novo sequencing and database search are now available.

The pre-trained model, training and testing data can be downloaded from here:

https://drive.google.com/drive/folders/1qB8wDBnnm1qw0wDuSCxOoxkyV-b4LkTo?usp=sharing

The following updates are also included in this version:

The implementation has been upgraded and tested on TensorFlow 1.2.
The code has been cleaned up with PEP8 and TensorFlow pylint guides, but many docstrings are still to be added.
Functional modules including I/O, training, de novo sequencing, database search, and testing should be group into separate worker classes. Same for the neural network models.

How to use DeepNovo?

DeepNovo is implemented and tested with Python 2.7, TensorFlow 1.2 and Cython.

Step 0: Build deepnovo_cython_setup to accelerate Python with C.

python deepnovo_cython_setup.py build_ext --inplace

Step 1: Test a pre-trained model with DeepNovo de novo sequencing

python deepnovo_main.py --train_dir train.example --decode --beam_search --beam_size 5

The testing mgf file is defined in "deepnovo_config.py", for example:

decode_test_file = "data.training/yeast.low.coon_2013/peaks.db.mgf.test.dup"

Step 2: Test a pre-trained model with DeepNovo database search

python deepnovo_main.py --train_dir train.example --search_db

The testing mgf file is defined in "deepnovo_config.py", for example:

input_file = "data.training/yeast.low.coon_2013/peaks.db.mgf.test.dup"

The results are written to the model folder "train.example".

Step 3: Train a DeepNovo model using the following command.

python deepnovo_main.py --train_dir train.example --train

The training mgf files are defined in "deepnovo_config.py", for example:

input_file_train = "data.training/yeast.low.coon_2013/peaks.db.mgf.train.dup"

input_file_valid = "data.training/yeast.low.coon_2013/peaks.db.mgf.valid.dup"

input_file_test = "data.training/yeast.low.coon_2013/peaks.db.mgf.test.dup"

The model files will be written to the training folder "train.example".

Step 4: De novo sequencing.

Currently DeepNovo supports training and testing modes. Hence, the real peptides need to be provided in the input mgf files with tag "SEQ=". If you want to do de novo sequencing, you can provide any arbitraty sequence for tag "SEQ=" to bypass the reading procedure. In the output file, you can ignore the calculation of accuracy and simply use the predicted peptide sequence.

All other options can be found in "deepnovo_config.py".

deepnovo's People

Contributors

Stargazers

Watchers

Forkers

nancyzxll yangtuo250 adder coldfire79 bcui6611 meijieh linhe43 rintukutum yunyouhuang vishalbelsare tondre royak khawlaseddiki hjgogoing wangdang511 caizj15 jing-wei zhengjiewhu mantourobot stschulze korrawe georgesbed ianyfchang jing-xinxing hannesbaukmann kevinmcdonnell6 animesh zrolfs zhichaoliu2 vivekmathema ancientor samuel0325 latte193 fineliu bravokid47 liangzhendong123 bharathabnair suleimanaminu yongchaodou grabriellechenchenchen

deepnovo's Issues

AssertionError: Error: wrong input PEPMASS

Hi Heiu,

I have mzml files which were converted to mgf extension using ProteoWizard. I get this error from deepnovo

'''
Traceback (most recent call last):
File "/opt/DeepNovo/deepnovo_main.py", line 83, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/opt/DeepNovo/deepnovo_main.py", line 61, in main
predicted_denovo_list = worker_denovo.search_denovo(model, worker_io)
File "/opt/DeepNovo/deepnovo_worker_denovo.py", line 109, in search_denovo
spectrum_batch = worker_io.get_spectrum(location_batch)
File "/opt/DeepNovo/deepnovo_worker_io.py", line 93, in get_spectrum
intensity_list) = self._parse_spectrum(location)
File "/opt/DeepNovo/deepnovo_worker_io.py", line 208, in _parse_spectrum
precursor_mz, charge, scan, raw_sequence = self._parse_spectrum_header()
File "/opt/DeepNovo/deepnovo_worker_io.py", line 225, in _parse_spectrum_header
assert "PEPMASS=" in line, "Error: wrong input PEPMASS"

'''
It looks like is an incompatibility issue related to the format of you're using and the one obtained from ProteoWizard. Check the first lines of both files:

Your example file ( peaks.db.mgf.test.dup)

''
'BEGIN IONS
TITLE=C:\Users\nh2tran\WORKING\DeepNovo\DeepDB\PEAKS\yeast.low.coon_2013\run_ALL_C:\Users\nh2tran\WORKING\DeepNovo\DeepDB\data\yeast.low.coon_2013\singleShot_Fusion-1493228696289\10sep2013_yeast_control_1.raw_SCANS_1393
PEPMASS=304.494
CHARGE=3+
SCANS=F1:1393
RTINSECONDS=131.85599
SEQ=QIVHDSGR
120.147 14.099166
120.784 43.689667
127.063 20.362473

'''

Obtained from Proteowizard

'''
BEGIN IONS
TITLE=KristofferS_H1507_145.2.2.1 File:"KristofferS_H1507_145.raw", NativeID:"controllerType=0 controllerNumber=1 scan=2"
RTINSECONDS=0.6773028
PEPMASS=391.285153404545 1214043.75
CHARGE=1+
149.0232544 524525.5138634257
150.0265961 19292.7731562133
167.0337524 28203.4820123287
172.3424682 858.6995042906
183.1525574 775.4042822765
'''

I already tried these without any success:

Delete the second column in PEPMASS
Reduce the number of decimal for the value in PEPMASS

Questions:
a. Do you think is this associated with the order of the LABELS?
b. Do you recommend a particular converter?

best

Carlos

Cython compatible version

Hi,

Since you're making changes in tensorflow, Can you please specify which version of Cython should I use according to the two reported tensorflow version used in DeepNovo?
Do you have a container of DeepNovo? I am trying to run it and keep getting errors (I already mailed to [email protected] with details)

thanks in advance

filters in code not consistent with it in paper

I'm interested in your nice work. But I have confusion about it. In spectrum-CNN part in SI Text of your paper, It seems that the filter's size in paper(2*10) not consistent with that in your code(# conv1: filter [1, 4, 1, 4] with stride [1, 1, 1, 1]).

ImportError: cannot import name rnn_cell_impl

Hi @nh2tran,

I'm trying to run Deepnovo for my sample, and I got the following error.
I've got the same error when I tried Step 1.

python deepnovo_main.py --train_dir train.example --train
vocab_reverse ['_PAD', '_GO', '_EOS', 'A', 'R', 'N', 'Nmod', 'D', 'Cmod', 'E', 'Q', 'Qmod', 'G', 'H', 'I', 'L', 'K', 'M', 'Mmod', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
vocab {'_GO': 1, '_EOS': 2, '_PAD': 0, 'Mmod': 18, 'A': 3, 'E': 9, 'D': 7, 'G': 12, 'F': 19, 'I': 14, 'H': 13, 'K': 16, 'M': 17, 'L': 15, 'Nmod': 6, 'N': 5, 'Q': 10, 'P': 20, 'S': 21, 'R': 4, 'T': 22, 'W': 23, 'V': 25, 'Y': 24, 'Cmod': 8, 'Qmod': 11}
vocab_size 26
SPECTRUM_RESOLUTION 10
WINDOW_SIZE 10
MAX_LEN 30
_buckets [12, 22, 32]
num_ion 8
l2_loss_weight 0.0
embedding_size 512
num_layers 1
num_units 512
keep_conv 0.75
keep_dense 0.5
batch_size 128
epoch_stop 20
train_stack_size 4500
valid_stack_size 15000
test_stack_size 4000
buffer_size 4000
steps_per_checkpoint 100
random_test_batches 10
max_gradient_norm 5.0
Traceback (most recent call last):
File "deepnovo_main.py", line 15, in
import deepnovo_model
File "/Data2/HJE/DeepNovo/deepnovo_model.py", line 43, in
from tensorflow.python.ops import rnn_cell_impl
ImportError: cannot import name rnn_cell_impl

Running DeepNovo with on Windows and Python 3

I would like to try DeepNovo but have been unable to get it running on my machine so far.
It's a Windows OS and TensorFlow doesn't seem to be available for Python2.7 on that system.
So I have been trying to modify DeepNovo in order to be able to use Windows and Python 3.7 (checkout my fork https://github.com/StSchulze/DeepNovo.git).
However, when I try to run the "Test a pre-trained model with DeepNovo de novo sequencing" example, I get the following error:

vocab_reverse  ['_PAD', '_GO', '_EOS', 'A', 'R', 'N', 'Nmod', 'D', 'Cmod', 'E', 'Q', 'Qmod', 'G', 'H', 'I', 'L', 'K', 'M', 'Mmod', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
vocab  {'_PAD': 0, '_GO': 1, '_EOS': 2, 'A': 3, 'R': 4, 'N': 5, 'Nmod': 6, 'D': 7, 'Cmod': 8, 'E': 9, 'Q': 10, 'Qmod': 11, 'G': 12, 'H': 13, 'I': 14, 'L': 15, 'K': 16, 'M': 17, 'Mmod': 18, 'F': 19, 'P': 20, 'S': 21, 'T': 22, 'W': 23, 'Y': 24, 'V': 25}
vocab_size  26
SPECTRUM_RESOLUTION  10
WINDOW_SIZE  10
MAX_LEN  50
_buckets  [12, 22, 32]
num_ion  8
l2_loss_weight  0.0
embedding_size  512
num_layers  1
num_units  512
keep_conv  0.75
keep_dense  0.5
batch_size  128
epoch_stop  20
train_stack_size  4500
valid_stack_size  15000
test_stack_size  4000
buffer_size  4000
steps_per_checkpoint  100
random_test_batches  10
max_gradient_norm  5.0
main()
2019-05-08 10:53:46.365188: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
DECODING MODEL
================================================================================
ModelInference: __init__()
================================================================================
ModelNetwork: __init__()
================================================================================
ModelInference: build_model()
================================================================================
ModelNetwork: build_network()
================================================================================
ModelNetwork: _build_cnn_spectrum()
WARNING:tensorflow:From C:\Users\Admin\Desktop\ursgal_dev\DeepNovo_StSchulze\DeepNovo\deepnovo_model.py:474: UniformUnitScaling.__init__ (from tensorflow.python.ops.init_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.initializers.variance_scaling instead with distribution=uniform to get equivalent behavior.
WARNING:tensorflow:From C:\Program Files\Python3.7\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From C:\Users\Admin\Desktop\ursgal_dev\DeepNovo_StSchulze\DeepNovo\deepnovo_model.py:504: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
================================================================================
ModelNetwork: _build_embedding_AAid()
================================================================================
ModelNetwork: _build_cnn_ion()
================================================================================
ModelNetwork: _build_lstm()
WARNING:tensorflow:From C:\Users\Admin\Desktop\ursgal_dev\DeepNovo_StSchulze\DeepNovo\deepnovo_model.py:573: BasicLSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
================================================================================
ModelNetwork: _build_cnn_ion()
================================================================================
ModelNetwork: _build_lstm()
================================================================================
ModelInference: restore_model()
Error: model not found.

The upper warnings are just deprecation warnings for TensorFlow (I'm not familiar enough with TensorFlow in order to solve this, but it shouldn't be a major issue), but I don't know how to get around the "model not found" error. Any ideas on this?

Thanks for any help!

AttributeError: 'module' object has no attribute '_linear'

I was try to run DeepNovo, but encouter the error,below. By the way can I run DeepNovo-DIA with DDA MS data?

python ../deepnovo_main.py --train_dir train.example --decode --beam_search --beam_size 5
vocab_reverse  ['_PAD', '_GO', '_EOS', 'A', 'R', 'N', 'Nmod', 'D', 'Cmod', 'E', 'Q', 'Qmod', 'G', 'H', 'I', 'L', 'K', 'M', 'Mmod', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
vocab  {'_GO': 1, '_EOS': 2, '_PAD': 0, 'Mmod': 18, 'A': 3, 'E': 9, 'D': 7, 'G': 12, 'F': 19, 'I': 14, 'H': 13, 'K': 16, 'M': 17, 'L': 15, 'Nmod': 6, 'N': 5, 'Q': 10, 'P': 20, 'S': 21, 'R': 4, 'T': 22, 'W': 23, 'V': 25, 'Y': 24, 'Cmod': 8, 'Qmod': 11}
vocab_size  26
SPECTRUM_RESOLUTION  10
WINDOW_SIZE  10
MAX_LEN  50
_buckets  [12, 22, 32]
num_ion  8
l2_loss_weight  0.0
embedding_size  512
num_layers  1
num_units  512
keep_conv  0.75
keep_dense  0.5
batch_size  128
epoch_stop  20
train_stack_size  4500
valid_stack_size  15000
test_stack_size  4000
buffer_size  4000
steps_per_checkpoint  100
random_test_batches  10
max_gradient_norm  5.0
main()
2019-04-03 21:05:18.040118: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
DECODING MODEL
================================================================================
ModelInference: __init__()
================================================================================
ModelNetwork: __init__()
================================================================================
ModelInference: build_model()
================================================================================
ModelNetwork: build_network()
================================================================================
ModelNetwork: _build_cnn_spectrum()
WARNING:tensorflow:From /public/home/nong/bin/DeepNovo/deepnovo_model.py:473: __init__ (from tensorflow.python.ops.init_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.initializers.variance_scaling instead with distribution=uniform to get equivalent behavior.
================================================================================
ModelNetwork: _build_embedding_AAid()
================================================================================
ModelNetwork: _build_cnn_ion()
Traceback (most recent call last):
  File "../deepnovo_main.py", line 83, in <module>
    tf.app.run()
  File "/public/home/nong/miniconda3/envs/deepnovo/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "../deepnovo_main.py", line 35, in main
    deepnovo_main_modules.decode()
  File "/public/home/nong/bin/DeepNovo/deepnovo_main_modules.py", line 1982, in decode
    model.build_model()
  File "/public/home/nong/bin/DeepNovo/deepnovo_model.py", line 707, in build_model
    self.dropout_keep)
  File "/public/home/nong/bin/DeepNovo/deepnovo_model.py", line 310, in build_network
    direction)
  File "/public/home/nong/bin/DeepNovo/deepnovo_model.py", line 440, in _build_cnn_ion
    cnn_ion_logit = rnn_cell_impl._linear(args=cnn_ion_feature,
AttributeError: 'module' object has no attribute '_linear'

Missing files in data.training?

Hi @nh2tran ,

Thank you for your development!

I'm trying to run your DeepNovo on my machine, and tentatively cloned and downloaded required files (and downgladed TF to 1.2).

After Cython compilation, I run the below command and faced an error:
python deepnovo_main.py --train_dir train.example --decode --beam_search --beam_size 5

[Errno 2] No such file or directory: 'data.training/dia.xchen.nov27/fraction_1.mgf.split.test.dup'

In your google drive, dia.xchen.nov27/fraction_1.mgf.split.test.dup does not exist.
This looks line a DIA data, and if so, which lines I should comment out in the config file to test?

Please advise.

Best,
Yoshinori

ImportError: No module named Bio

Are there more dependencies? I am getting the following error, when I try the code in the readme file.

> python deepnovo_main.py --train_dir train.example --decode --beam_search --beam_size 5
from Bio import SeqIO
ImportError: No module named Bio

Show Single Amino Acid Score in resultfile?

Hello,

is there an easy option to show the single amino acid scores for each de novo prediction? Currently (Update version 0.0.1), the result file only shows the peptide score for each prediction. I would like to have a information on the single AA prediction strength (like in DeepNovo-PNAS).

Thank you in advance.

file missing

Dear @nh2tran ,

Thank you for your open source project. I'm trying to reproduce the code, after runing

python deepnovo_main.py --train_dir train.example --decode --beam_search --beam_size 5

[Errno 2] No such file or directory: 'data.training/dia.xchen.nov27/fraction_1.mgf.split.test.dup'

It seems this file does not exsit in your google drive. How should I fix it?

Thank you again.

best regard
Guangchao

DeepNovo Installation Issue

Greetings,

I am trying to install DeepNovo and consistently hitting the error message indicated below. Please do advice me on how to troubleshoot. Thank you.

Regards,
Ben

packaging for brew manager?

Awesome tool-- just wanted to see if the authors had any intention on making DeepNovo available via the brew package manager... it simplifies things for a specific type of user...

At any rate, I can't wait to pull the source...

Thanks,
J