alexarnimueller / modlamp Goto Github PK

View Code? Open in Web Editor NEW

49.0 4.0 17.0 40.85 MB

Python package for peptide sequence generation, peptide descriptor calculation and sequence analysis.

Home Page: https://modlamp.org

License: Other

Makefile 0.05% Python 99.95%

modlamp's Introduction

README

modlAMP

This is a Python package that is designed for working with peptides, proteins or any amino acid sequence of natural amino acids. It incorporates several modules, like descriptor calculation (module descriptors) or sequence generation (module sequences). For basic instructions how to use the package, see Usage section of this README or the documentation.

Installation

Quick note: modlAMP supports Python 3 since version 4. Use with Python 2.7 is deprecated.

For the installation to work properly, pip needs to be installed. If you're not sure whether you already have pip, type pip --version in your terminal. If you don't have pip installed, install it via sudo easy_install pip.

There is no need to download the package manually to install modlAMP. In your terminal, just type the following command:

pip install modlamp

To update modlamp to the latest version, run the following:

pip install --upgrade modlamp

Usage

This section gives a quick overview of different capabilities of modlAMP. For a detailed description of all modules see the module documentation.

Importing modules

After installation, you should be able to import and use the different modules like shown below. Type python or ipython in your terminal to begin, then the following import statements:

>>> from modlamp.sequences import Helices
>>> from modlamp.descriptors import PeptideDescriptor
>>> from modlamp.database import query_database

Generating Sequences

The following example shows how to generate a library of 1000 sequences out of all available sequence generation methods:

>>> from modlamp.sequences import MixedLibrary
>>> lib = MixedLibrary(1000)
>>> lib.generate_sequences()
>>> lib.sequences[:10]
['VIVRVLKIAA','VGAKALRGIGPVVK','QTGKAKIKLVKLRAGPYANGKLF','RLIKGALKLVRIVGPGLRVIVRGAR','DGQTNRFCGI','ILRVGKLAAKV',...]

These commands generated a mixed peptide library comprising of 1000 sequences.

Note

If duplicates are present in the attribute sequences, these are removed during generation. Therefore it is possible that less than the specified sequences are obtained.

The module sequences incorporates different sequence generation classes (random, helices etc.). For documentation thereof, consider the docs for the module modlamp.sequences.

Calculating Descriptor Values

Now, different descriptor values can be calculated for the generated sequences: (see Generating Sequences)

How to calculate the Eisenberg hydrophobic moment for given sequences:

>>> from modlamp.descriptors import PeptideDescriptor, GlobalDescriptor
>>> desc = PeptideDescriptor(lib.sequences,'eisenberg')
>>> desc.calculate_moment()
>>> desc.descriptor[:10]
array([[ 0.60138255],[ 0.61232763],[ 0.01474009],[ 0.72333858],[ 0.20390763],[ 0.68818279],...]

Global descriptor features like charge, hydrophobicity or isoelectric point can be calculated as well:

>>> glob = GlobalDescriptor(lib.sequences)
>>> glob.isoelectric_point()
>>> glob.descriptor[:10]
array([ 10.09735107,   8.75006104,  12.30743408,  11.26385498, ...]

Auto- and cross-correlation type functions with different window sizes can be applied on all available amino acid scales. Here an example for the pepCATS scale:

>>> pepCATS = PeptideDescriptor('sequence/file/to/be/loaded.fasta', 'pepcats')
>>> pepCATS.calculate_crosscorr(7)
>>> pepCATS.descriptor
array([[ 0.6875    ,  0.46666667,  0.42857143,  0.61538462,  0.58333333,

Many more amino acid scales are available for descriptor calculation. The complete list can be found in the documentation for the modlamp.descriptors module.

Plotting Features

We can also plot the calculated values as a boxplot, for example the hydrophobic moment:

>>> from modlamp.plot import plot_feature
>>> D = PeptideDescriptor('sequence/file/to/be/loaded.fasta', 'eisenberg')  # Eisenberg hyrophobicity scale
>>> D.calculate_moment()
>>> plot_feature(D.descriptor,y_label='uH Eisenberg')

We can additionally compare these descriptor values to known AMP sequences. For that, we import sequences from the APD3, which are stored in the FASTA formatted file APD3.fasta:

>>> APD = PeptideDescriptor('/Path/to/file/APD3.fasta', 'eisenberg')
>>> APD.calculate_moment()

Now lets compare the values by plotting:

>>> plot_feature([D.descriptor, APD.descriptor], y_label='uH Eisenberg', x_tick_labels=['Library', 'APD3'])

It is also possible to plot 2 or 3 different features in a scatter plot:

Example:	2D Scatter Plot

>>> from modlamp.plot import plot_2_features
>>> A = PeptideDescriptor('/Path/to/file/class1&2.fasta', 'eisenberg')
>>> A.calculate_moment()
>>> B = GlobalDescriptor('/Path/to/file/class1&2.fasta')
>>> B.isoelectric_point()
>>> target = [1] * (len(A.sequences) / 2) + [0] * (len(A.sequences) / 2)
>>> plot_2_features(A.descriptor, B.descriptor, x_label='uH', y_label='pI', targets=target)

Example:	3D Scatter Plot

>>> from modlamp.plot import plot_3_features
>>> B = GlobalDescriptor(APD.sequences)
>>> B.isoelectric_point()
>>> B.length(append=True)  # append descriptor values to afore calculated
>>> plot_3_features(APD.descriptor, B.descriptor[:, 0], B.descriptor[:, 1], x_label='uH', y_label='pI', z_label='len')

Example:	Helical Wheel Plot

>>> from modlamp.plot import helical_wheel
>>> helical_wheel('GLFDIVKKVVGALGSL', moment=True)

Further plotting methods are available. See the documentation for the modlamp.plot module.

Database Connection

Peptides from the two most prominent AMP databases APD and CAMP can be directly scraped with the modlamp.database module.

For downloading a set of sequences from the APD database, first get the IDs of the sequences you want to query from the APD website. Then proceed as follows:

>>> query_apd([15, 16, 17, 18, 19, 20])  # download sequences with APD IDs 15 to 20
['GLFDIVKKVVGALGSL','GLFDIVKKVVGAIGSL','GLFDIVKKVVGTLAGL','GLFDIVKKVVGAFGSL','GLFDIAKKVIGVIGSL','GLFDIVKKIAGHIAGSI']

The same holds true for the CAMP database:

>>> query_camp([2705, 2706])  # download sequences with CAMP IDs 2705 & 2706
['GLFDIVKKVVGALGSL','GLFDIVKKVVGTLAGL']

modlAMP also hosts a module for connecting to your own database on a private server. Peptide sequences included in any table in the database can be downloaded.

Note

The modlamp.database.query_database function allows connection and queries to a personal database. For successful connection, the database configuration needs to be specified in the db_config.json file, which is located in modlamp/data/ by default.

Sequences (stored in a column named sequence) from the personal database can then be queried as follows:

>>> from modlamp.database import query_database
>>> query_database('my_experiments', ['sequence'], configfile='./modlamp/data/db_config.json')
Password: >? ***********
Connecting to MySQL database...
connection established!
['ILDSSWQRTFLLS','IKLLHIF','ACFDDGLFRIIKFLLASDRFFT', ...]

Loading Prepared Datasets

For AMP QSAR models, different options exist of choosing negative / inactive peptide examples. We assembled several datasets for classification tasks, that can be read by the modlamp.datasets module. The available datasets can be found in the documentation of the modlamp.datasets module.

Example:	AMPs vs. transmembrane regions of proteins

>>> from modlamp.datasets import load_AMPvsTM
>>> data = load_AMPvsTM()
>>> data.keys()
['target_names', 'target', 'feature_names', 'sequences']

The variable data holds four different keys, which can also be called as its attributes. The available attributes for load_helicalAMPset() are target_names (target names), target (the target class vector), feature_names (the name of the data features, here: 'Sequence') and sequences (the loaded sequences).

Example:

>>> data.target_names  # class names
array(['TM', 'AMP'], dtype='|S3')
>>> data.sequences[:5]  # sequences
[array(['AAGAATVLLVIVLLAGSYLAVLA', 'LWIVIACLACVGSAAALTLRA', 'FYRFYMLREGTAVPAVWFSIELIFGLFA', 'GTLELGVDYGRAN',
       'KLFWRAVVAEFLATTLFVFISIGSALGFK'],  dtype='|S100')
>>> data.target  # corresponding target classes
array([0, 0, 0, 0, 0 .... 1, 1, 1, 1])

Analysing Wetlab Circular Dichroism Data

The modlule modlamp.wetlab includes the class modlamp.wetlab.CD to analyse raw circular dichroism data from wetlab experiments. The following example shows how to load a raw datafile and calculate secondary structure contents:

>>> cd = CD('/path/to/your/folder', 185, 260)  # load all files in a specified folder
>>> cd.names  # peptide names read from the file headers
['Pep 10', 'Pep 10', 'Pep 11', 'Pep 11', ... ]
>>> cd.calc_meanres_ellipticity()  # calculate the mean residue ellipticity values
>>> cd.meanres_ellipticity
array([[   260.        ,   -266.95804196],
       [   259.        ,   -338.13286713],
       [   258.        ,   -387.25174825], ...])
>>> cd.helicity(temperature=24., k=3.492185008, induction=True)  # calculate helical content
>>> cd.helicity_values
            Name     Solvent  Helicity  Induction
            Peptide1     T    100.0     3.823
            Peptide1     W    26.16     0.000
            Peptide2     T    76.38     3.048
            Peptide2     W    25.06     0.000 ...

The read and calculated values can finally be plotted as follows:

>>> cd.plot(data='mean residue ellipticity', combine=True)

Analysis of Different Sequence Libraries

The modlule modlamp.analysis includes the class modlamp.analysis.GlobalAnalysis to compare different sequence libraries. Learn how to use it with the following example:

>>> lib  # sequence library with 3 sub-libraries
array([['ARVFVRAVRIYIRVLKAFAKL', 'IRVYVRIVRGFGRVVRAYARV', 'IRIFIRIARGFGRAIRVFVRI', ..., 'RGPCFLQVVD'],
       ['EYKIGGKA', 'RAVKGGGRLLAG', 'KLLRIILRGARIIIRGLR', ..., 'AKCLVDKK', 'VGGAFALVSV'],
       ['GVHLKFKPAVSRKGVKGIT', 'RILRIGARVGKVLIK', 'MKGIIGHTWKLKPTIPSGKSAKC', ..., 'GRIIRLAIKAGL']], dtype='|S28')
>>> lib.shape
(3, 2000)
>>> from modlamp.analysis import GlobalAnalysis
>>> analysis = GlobalAnalysis(lib, names=['Lib 1', 'Lib 2', 'Lib 3'])
>>> analysis.plot_summary()

Documentation

A detailed documentation of all modules is available from the modlAMP documentation website.

Citing modlAMP

If you are using modlAMP for a scientific publication, please cite the following paper:

Müller A. T. et al. (2017) modlAMP: Python for anitmicrobial peptides, Bioinformatics 33, (17), 2753-2755, DOI:10.1093/bioinformatics/btx285.

modlamp's People

Contributors

Stargazers

Watchers

Forkers

ggabernet aspirincode hwang-happy ethmodlab songminghu2004 grisonifr manzhaohui sailfish009 shunsunsun biocreator unixjunkie biorabiei rocke2020 ys-arch

modlamp's Issues

plot.py plot_aa_distr bug

Hi,

There is a bug within plot.py, function plot_aa_distr. It should be as it used to be:

for a in range(20):
  plt.bar(a, list(aa.values())[a], 0.9, color=color)

Rather than (680-682):

for i, v in enumerate([k for k, w in aa.items()]):
  plt.bar(i, v, 0.9, color=color)

Vito

Issue in obtaining descriptors for FASTA data

Hello @alexarnimueller

I am having an issue in obtaining the descriptor data for the FASTA data here- http://caps.ncbs.res.in/3dswap-pred/data/3dswap-pred_positive_dataset.fasta

Here is the program I am running-

from modlamp.descriptors import PeptideDescriptor
pepdesc = PeptideDescriptor('3dswap-pred_negative_dataset.fasta', 'eisenberg') 
pepdesc.calculate_global()
pepdesc.calculate_moment(append=True)  
pepdesc.load_scale('z3')
pepdesc.calculate_autocorr(1, append=True)
col_names = 'ID,Sequence,H_Eisenberg,uH_Eisenberg,Z3_1,Z3_2,Z3_3'
pepdesc.save_descriptor('neg_descriptors1.csv', header=col_names)

I am obtaining this error-

Traceback (most recent call last):
  File "desc_negative.py", line 8, in <module>
    pepdesc.calculate_global()  # calculate global Eisenberg hydrophobicity
  File "/usr/local/lib/python2.7/dist-packages/modlamp/descriptors.py", line 802, in calculate_global
    mtrx.append(self.scale[str(seq[l])])
KeyError: 'X'

Difficulty in Analysis of Different Sequence Libraries

Dear Sir
I am using modlamp.analysis module for analysing the peptide sequence dataset. I am able to run the g = GlobalAnalysis(['GLFDIVKKVVGALG', 'KLLKLLKKLLKLLK', ...], names=['Library1']) for amino acid frequency calculation and summary plot but facing difficulty in inputting the dataset as dataframe.

I converted csv file into dataframe but was not able to do analysis using above commands. i am getting the syntax error. Can you please guide me how to use dataframe in above script.
furthermore, I want to ask whether Analysis of Different Sequence Libraries only takes input in form of list/array only. If yes, how can i use my peptide dataframe dataset to do analysis.
I have 3 libraries of peptide in form of column.

Please help
Regards
Sandeep

GlobalAnalysis Plot

In the plot_summary of the GlobalAnalysis plot, the legend grows over the amide and pH info if more than 3 libraries are plotted.
if only one library is plotted, the bars in the AA distribution plot are shifted too much towards the right

pips internal functions issue while building biocoda package

I have successfully added modlamp==4.1.2 to bioconda, however, while I am trying to build a package for 4.1.4 the latest release it's showing an error given below while building the conda package. The similar error I also experience while I am trying to install modlamp==4.1.4 using pip to my local computer and install failed eventually.

13:22:46 BIOCONDA INFO (OUT) Added file://$SRC_DIR to build tracker '/tmp/pip-req-tracker-u0iC5E'
13:22:46 BIOCONDA INFO (OUT) Running setup.py (path:$SRC_DIR/setup.py) egg_info for package from file://$SRC_DIR
13:22:46 BIOCONDA INFO (OUT) Running command python setup.py egg_info
13:22:46 BIOCONDA INFO (OUT) Created temporary directory: /tmp/pip-pip-egg-info-hJByB2
13:22:46 BIOCONDA INFO (OUT) Traceback (most recent call last):
13:22:46 BIOCONDA INFO (OUT) File "", line 1, in
13:22:46 BIOCONDA INFO (OUT) File "/opt/conda/conda-bld/modlamp_1588166316542/work/setup.py", line 11, in
13:22:46 BIOCONDA INFO (OUT) reqs = [str(ir.req) for ir in install_reqs][:-1]
13:22:46 BIOCONDA INFO (OUT) AttributeError: 'ParsedRequirement' object has no attribute 'req'

details can be found on this "bioconda/bioconda-recipes#21839"

After discussing this issue with Bioconda community, it looks like some pip's internal functions have been used that will break over time. resolve this issue will help to build this package for conda.

Issue in the modlAMP example script for peptide classification

@alexarnimueller There are a few issues in the example script for peptide classification. Like in Line 17, it shows a NameError for desc.

I have made some fixes to the script. Should I send a PR for the same?

Improve documentation for descriptors.GlobalDescriptor

https://modlamp.org/modlamp.html#modlamp.descriptors.GlobalDescriptor.calculate_all

The method descriptors.GlobalDescriptor returns an array of 10 elements (everything except molecular formula). However, the documented example shows an array of 9 elements; sequence charge is missing. This was quite baffling to me when I first used the method.

The documentation could be written as such for better clarity:
"Method combining all 10 global descriptors (except molecular formula)..."

Calculation of amino acid probabilities

How were the probabilities for amino acids in the class BaseSequence() calculated? Particularly, how was prob_ACPhel computed?
Thank You!

py3?

Very nice project! Unfortunately, the rest of my code base if very much py3 based - have you considered adding py3 support? Would you be interested in contributions?

For the pepCATS descriptor, are the bit patterns listed somewhere?

Hello,
In the code, is there a table giving the bit pattern for each of the 20 amino acids?
I looked in the code but couldn't find it.
Thanks a lot,
F.

GlobalAnalysis of different seq_database

Hello @alexarnimueller：
I used the GlobalAnalysis method you provided to compare different sequence libraries:

Traceback (most recent call last):
File "d:/reoccur/amp-gan/ampgan/evaluation/analyze_seq.py", line 44, in
analysis.plot_summary()
File "E:\anaconda3\envs\ampgan_test\lib\site-packages\modlamp\analysis.py", line 304, in plot_summary
color=colors[:num])
File "E:\anaconda3\envs\ampgan_test\lib\site-packages\matplotlib_init_.py", line 1414, in inner
return func(ax, *map(sanitize_sequence, args), **kwargs)
File "E:\anaconda3\envs\ampgan_test\lib\site-packages\matplotlib\axes_axes.py", line 6597, in hist
raise ValueError(f"The 'color' keyword argument must have one "
ValueError: The 'color' keyword argument must have one color per dataset, but 2000 datasets and 3 colors were provided

The 2 D array shape of my input is (3,5000)。

tests

score_cv and test_amphiarc failing

Incorrect formula for Glutamine using GlobalDescriptor

Using the code
desc = GlobalDescriptor(['Q'])
desc.formula(amide=False)
for v in desc.descriptor:
print(v[0])

I get:
C4 H7 N1 O4

The correct formula for Glutamine is:
C5H10N2O3

Running SVM on AMP vs UniProt shows an error.

@alexarnimueller I am getting an error on running the AMP classification using SVM.

Traceback (most recent call last):
  File "classify-amp.py", line 27, in <module>
    lib.generate_sequences()
  File "/home/ssouravsingh12/.local/lib/python2.7/site-packages/modlamp/sequences.py", line 536, in generate_sequences
    H.generate_sequences()
  File "/home/ssouravsingh12/.local/lib/python2.7/site-packages/modlamp/sequences.py", line 136, in generate_sequences
    seq = ['X'] * random.choice(range(self.lenmin, self.lenmax + 1))
  File "mtrand.pyx", line 1121, in mtrand.RandomState.choice (numpy/random/mtrand/mtrand.c:17200)
ValueError: a must be non-empty

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.