sbrg / ssbio Goto Github PK

View Code? Open in Web Editor NEW

103.0 17.0 28.0 39.11 MB

A Python framework for structural systems biology

Home Page: http://ssbio.readthedocs.io/en/latest/

License: MIT License

Python 93.01% Scheme 0.33% Perl 6.50% Jupyter Notebook 0.15%

systems-biology protein-structure structural-biology constraint-based-modeling cobrapy structural-systems-biology

ssbio's Introduction

ssbio: A Framework for Structural Systems Biology

Introduction

This Python package provides a collection of tools for people with questions in the realm of structural systems biology. The main goals of this package are to:

Provide an easy way to map hundreds or thousands of genes to their encoded protein sequences and structures
Directly link protein structures to genome-scale metabolic models
Demonstrate fully-featured Python scientific analysis environments in Jupyter notebooks

Example questions you can (start to) answer with this package:

How can I determine the number of protein structures available for my list of genes?
What is the best, representative structure for my protein?
Where, in a metabolic network, do these proteins work?
Where do popular mutations show up on a protein?
How can I compare the structural features of entire proteomes?
How do structural properties correlate with my experimental datasets?
How can I improve the contents of my metabolic model with structural data?

Try it without installing

Note

Binder notebooks are still in beta, but they mostly work! Third-party programs are also preinstalled in the Binder notebooks except I-TASSER and TMHMM due to licensing restrictions.

Installation

First install NGLview using pip, then install ssbio

pip install nglview
jupyter-nbextension enable nglview --py --sys-prefix
pip install ssbio

Updating

pip install ssbio --upgrade

Uninstalling

pip uninstall ssbio

Dependencies

See: Software for a list of external programs to install, along with the functionality that they add. Most of these additional programs are used to predict or calculate properties of proteins, and are only required if you desire to calculate the described properties.

Tutorials

Check out some Jupyter notebook tutorials for a single Protein and or for many in a GEM-PRO model. See a list of all Tutorials.

Citation

The manuscript for the ssbio package can be found and cited at [1].

[1]	Mih N, Brunk E, Chen K, Catoiu E, Sastry A, Kavvas E, Monk JM, Zhang Z, Palsson BO. 2018. ssbio: A Python Framework for Structural Systems Biology. Bioinformatics. https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty077/4850940.

ssbio's People

Contributors

Stargazers

Watchers

ssbio's Issues

json-tricks - saving a string with two forward slashes

Need to research -- saving a object with an attribute with two forward slashes, like in a url (http://) results in a loading error later, something about "control characters"

@EdwardCatoiu

MSMS installation on MacOS

Installation instructions (Unix) for MSMS are not working for me anymore. I have a MacOS (Ventura 13.4) and M1 chip. The link to the MSMS download brings me here, where there is nothing to download.

Option to run QC/QA for all structures

Currently, QC/QA for structures stops when a representative structure is found. However the following case is possible:

Multiple parts of the structure are homology modeled
Another structure has a ligand or something that we are interested in

In these cases we want to have the alignment/residue mapping available for these structures but we might not have them. "set_representative_structure" doesn't sound like the right function to get that info. Also another issue is that the alignment info is stored as the "repchain_index" in the representative sequence itself. There should be a better place to store that info since we would want to map more than one structure to the sequence in these cases.

Use of .ix indexing is deprecated in pandas & no pandas version specified in requirements.

Fail to read P00533

My code:


PROJECT = 'GTspec_query'
LIST_OF_GENES = ['P01106','P01229','P01374']
PDB_FILE_TYPE = 'mmtf'

# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_list=LIST_OF_GENES, pdb_file_type=PDB_FILE_TYPE)

# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ACC+ID')

fails and it appears to be due to a bad model in P00533. Error below

<ipython-input-2-d2f5408c2c3e> in <module>()
----> 1 execfile('02_structure_annote.py')

/media/ben/9c17f1c9-a45e-49ec-b547-8fbd2f25ccc6/GTspecificity/02_structure_annote.py in <module>()
    43
    44 # UniProt mapping
---> 45 my_gempro.uniprot_mapping_and_metadata(model_gene_source='ACC+ID')  ## [Nathan] can leave this as ACC+ID which means your list o
    46 print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
    47 my_gempro.df_uniprot_metadata.head()

/home/ben/anaconda2/lib/python2.7/site-packages/ssbio/pipeline/gempro.pyc in uniprot_mapping_and_metadata(self, model_gene_source, custo
   546                         uniprot_prop = g.protein.load_uniprot(uniprot_id=mapped_uniprot, download=True, outdir=outdir,
   547                                                               set_as_representative=set_as_representative,
--> 548                                                               force_rerun=force_rerun)
   549                     except HTTPError as e:
   550                         log.error('{}, {}: unable to complete web request'.format(g.id, mapped_uniprot))

/home/ben/anaconda2/lib/python2.7/site-packages/ssbio/core/protein.pyc in load_uniprot(self, uniprot_id, uniprot_seq_file, uniprot_xml_f
   378             if download:
   379                 uniprot_prop.download_metadata_file(outdir=outdir, force_rerun=force_rerun)
--> 380                 uniprot_prop.download_seq_file(outdir=outdir, force_rerun=force_rerun)
   381
   382             # Also check if UniProt sequence matches a potentially set representative sequence

/home/ben/anaconda2/lib/python2.7/site-packages/ssbio/databases/uniprot.pyc in download_seq_file(self, outdir, force_rerun)
   183                                                    force_rerun=force_rerun)
   184
--> 185         self.sequence_path = uniprot_fasta_file
   186
   187     def download_metadata_file(self, outdir, force_rerun=False):

/home/ben/anaconda2/lib/python2.7/site-packages/ssbio/protein/sequence/seqprop.pyc in sequence_path(self, fasta_path)

Wanted: structure viewer in Jupyter notebook

It would be nice to view a representative structure in a notebook or map mutations onto it. The idea would be to directly load structures within Jupyter notebooks perhaps using PV viewer or NGL viewer (which already has this kind of support).

Aquaria -- perhaps ssbio can have links to aquaria info pages

See documentation: https://docs.google.com/document/u/1/d/1566Ub_-WAXMcuOA8gmb2-X9JmYfWuZarrgneNVaBQd8/pub

Example: http://aquaria.ws/P51451

mapping pdb resnum to uniprot resnum

Hey nice library. I am interested in this function for reading the SIFTS residue XML files (which are a pain to parse!)

ssbio.databases.pdb.map_uniprot_resnum_to_pdb

Do you have any tips for going the other direction (I.e. map_pdb_resnum_to_uniprot)? Otherwise could this be a useful addition?

Difficulty installing SSBIO package due to dependency conflicts

Description:
I am currently working on a metabolic modelling project and would like to utilize the SSBIO and nglview packages. However, I am encountering installation issues specifically with SSBIO. It appears that SSBIO has numerous dependencies, causing the installation process to fail when using pip. The information provided in the requirement.txt file is insufficient for successful installation attempts. I have experimented with different versions of Python and corresponding pip versions, but none have resolved the problem. The installation process halts with an error message similar to the one below, observed when using Python 3.10.12 on Google Colab:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behavior is the source of the following dependency conflicts.
yfinance 0.2.21 requires beautifulsoup4>=4.11.1, but you have beautifulsoup4 4.5.3 which is incompatible.

I have also attempted installations on various Python versions and systems, but encountered similar error messages with conflicting packages.

Request:
Could you kindly provide more precise information regarding the module version and Python version used for successful installation? Alternatively, is it possible to share a functional Docker image that incorporates the necessary dependencies? This would greatly assist me in resolving the installation challenges and proceeding with my metabolic modelling project.

Intuitive coloring of protein structures

Protein structures in NGLview could be colored by their quality, eg. red for homology models and silver for experimental models. This gives a quick idea on the quality of information being presented. Other ideas

"figure legend" type of thing displaying quality information (c-score, basic PDB quality)
running QMEAN for all structures and reporting quality this way
coloring certain residues based on quality (b-factor)

Issue saving PDB 5LZD

Large structures such as 5LZD run into trouble when trying to save specific chains as PDB files when setting representative structures. Not sure if I can get around that - need to save them as mmCIF somehow.

characterize_residue_mutation will only return properties defined in EXTENDED_AA_PROPERTY_DICT_ONE

In ssbio.protein.sequence.properties.residues.characterize_residue_mutation(...)

for prop, aa_list in EXTENDED_AA_PROPERTY_DICT_ONE.items():

uses "EXTENDED_AA_PROPERTY_DICT_ONE" when it should use "propdict"

Tab completion issues on Jupyter notebook for a GEMPRO object

Seem to be having some issues when doing a tab complete for a GEMPRO object. Interrupting the kernel will kill the process unfortunately..need to hunt down the bug.

cachetools does not work in Python 2?

some people were getting import errors when I had cachetools.func decorating functions. not sure if this is a python 2 issue - check it out

Import Error - Bio.Alphabet has been removed from Biopython

ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the ``molecule_type`` as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

The Biopython version should be specified more tightly in requirements.txt

AttributeError: 'module' object has no attribute 'urlretrieve'

I'm getting the following error when running the example notebook "GEM-PRO pipeline - one gene with sequence.ipynb".

AttributeError Traceback (most recent call last)
in ()
----> 1 my_gempro.set_representative_structure()

/home/user/ssbio/ssbio/pipeline/gempro.pyc in set_representative_structure(self, seq_outdir, struct_outdir, pdb_file_type, engine, always_use_homology, seq_ident_cutoff, allow_missing_on_termini, allow_mutants, allow_deletions, allow_insertions, allow_unresolved, force_rerun)
1109 allow_insertions=allow_insertions,
1110 allow_unresolved=allow_unresolved,
-> 1111 force_rerun=force_rerun)
1112
1113 if not repstruct:

/home/user/ssbio/ssbio/core/protein.pyc in set_representative_structure(self, seq_outdir, struct_outdir, pdb_file_type, engine, always_use_homology, seq_ident_cutoff, allow_missing_on_termini, allow_mutants, allow_deletions, allow_insertions, allow_unresolved, force_rerun)
591 # This will add all chains to the mapped_chains attribute if there are none
592 try:
--> 593 pdb.download_structure_file(outdir=struct_outdir, file_type=pdb_file_type, force_rerun=force_rerun)
594 # Download the mmCIF header file to get additional information
595 if 'cif' not in pdb_file_type:

/home/user/ssbio/ssbio/databases/pdb.pyc in download_structure_file(self, outdir, file_type, force_rerun)
44 pdb_file = download_structure(pdb_id=self.id, file_type=file_type, only_header=False,
45 outdir=outdir,
---> 46 force_rerun=force_rerun)
47 log.debug('{}: downloaded {} file'.format(self.id, file_type))
48 self.load_structure_file(pdb_file, file_type)

/home/user/ssbio/ssbio/databases/pdb.pyc in download_structure(pdb_id, file_type, outdir, outfile, only_header, force_rerun)
127
128 if python_version == 2:
--> 129 urllib2.urlretrieve(download_link, outfile)
130 html = urllib2.urlopen(download_link).read()
131 with open(outfile, 'wb') as f:

AttributeError: 'module' object has no attribute 'urlretrieve'

It occurs at the "my_gempro.set_representative_structure()" line.

Revamp logging to disable at higher levels

# Temporarily disable logging messages
logging.disable(logging.WARNING)

<CODE>

# Re-enable logging
logging.disable(logging.NOTSET)

Importing ssbio on python2.7 cobra environment

Freshly installed cobra environment + ssbio runs into an import error for the package "ruamel_yaml", solved by:

pip install ruamel_yaml==0.11.14

See: https://stackoverflow.com/questions/41373834/conda-importerror-no-module-named-ruamel-yaml-comments

Add runtime estimates for third party programs

A run for a single protein of an external program would be useful to understand how long calculating a property would take.

Comprehensive calculation of amino acid composition

Could be integrated with packages such as: https://github.com/Superzchen/iFeature/

lxml etree doesn't work in Python 3.6?

Haven't looked into this yet -- import errors in 3.6 environment for etree.

Sequence to structure pairwise alignment parameter tuning

When running QC/QA, the first thing that is done is a pairwise alignment. The quality of this alignment can have a big impact of the result. The setting of gap score and gap extend penalties should be explored to prevent bad alignments -- example to be added later..

tmscore issue

ssbio/docs/notebooks/FATCAT - Structure Similarity.ipynb

Line 72 in 6f41652

"fatcat.parse_fatcat(fatcat_outfile)"

When I try to run parse_fatcat I get an error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-4-a327e7af1a57> in <module>()
----> 1 fatcat.parse_fatcat(fatcat_outfile)

/home/ben/anaconda2/lib/python2.7/site-packages/ssbio/protein/structure/properties/fatcat.pyc in parse_fatcat(fatcat_xml)
     96     # Find the tmScore of the alignment
     97     if soup.find('block'):
---> 98         fatcat_results['tm_score'] = float(soup.find('afpchain')['tmscore'])
     99 
    100     return fatcat_results

/home/ben/anaconda2/lib/python2.7/site-packages/bs4/element.pyc in __getitem__(self, key)
    995         """tag[key] returns the value of the 'key' attribute for the tag,
    996         and throws an exception if it's not there."""
--> 997         return self.attrs[key]
    998 
    999     def __iter__(self):

KeyError: 'tmscore'

There seems to be an issue with the tm_score or tmscore key

Cleaning mmtf files and saving as pdbs

Currently the most recent release of Biopython (1.6.8) doesn't parse mmtf files correctly (altlocs and disordered flags are set wrong). The developmental release does, however for now if a user has 1.6.8, cleaning structure files doesn't work properly when mmtf is set as the default. So for now -- mmcif will be set as the default file format until the next Biopython release.

See biopython/biopython#975

pdb_file_type does not propagate when loading a GEM

A change in pdb_file_type does not work if loading a model. It needs to be set somewhere.

No module named json_r

I'm getting the following error after a new git pull

Virtual Jupyter notebooks (Everware, Binder) to have external software preinstalled for workflows would be much better for sustainability

Problems when saving taxonomy ID and other potential non existing fields from a PDB

ImportError: cannot import name 'one_to_three' from 'Bio.PDB.Polypeptide'

The newest version from Biopython does not have 'one_to_three' anymore which is why importing ssbio.protein.sequence.properties.residues fails. Please revise when possible.

Pyspark and parquet for sequences analysis to parallelize computations

https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files

Issues running the tutorial

ssbio/docs/tutorials.ipynb

Line 7 in 6f41652

"# Index: Tutorials\n",

There are two issues I've found running this code:

I think shutil.which is only availible for python3 but the ipython notbook defaults to opening in python2. I switched to using subprocess and it works fine

def check_path(path):
    """Check if the specified program is in the PATH and can be run in a shell."""
    import subprocess
    
    checker = subprocess.check_output(['which',path])
    if checker:
        print('SUCCESS: {} found!'.format(path))
        return checker
    else:
        raise OSError('FAILURE: unable to find {}'.format(path))

fatcat and scratch install instructions do not create mononyms for these programs. This is easily resolved:

sudo ln -s /<path>/biojava-protein-comparison-tool-4.0.0/runFATCAT.sh /usr/bin/fatcat
sudo ln -s /media/<path>/SCRATCH-1D_1.1/bin/run_SCRATCH-1D_predictors.sh /usr/bin/scratch

Architectural change todo -- PDBProp objects with the same ID for a GEM-PRO should point to the same object rather than being created under different proteins

This also will make the transition to using the hadoop sequence file for structures much easier.

COBRApy's newest release (0.6.1) moves json attributes into a new module

COBRApy has refactored their json attribute definitions into a new "dict.py" module -- refactor io code to reflect this

https://github.com/opencobra/cobrapy/blob/devel/cobra/io/dict.py

Problematic PDB files

The best_structures API returns chains that are seemingly not present in these files. Find out why get_pdb_seqs doesn't have these chains. These look to be all mmCIF structures.

WARNING:ssbio.pipeline.gempro:5iqr: chain 8 not found in structure!
WARNING:ssbio.pipeline.gempro:5l3p: chain z not found in structure!
WARNING:ssbio.pipeline.gempro:5kpw: chain 33 not found in structure!
WARNING:ssbio.pipeline.gempro:5kpx: chain 33 not found in structure!
WARNING:ssbio.pipeline.gempro:5kpv: chain 33 not found in structure!

Add feature to map sequences to existing database (100% sequence identity only)

Currently, if sequences are provided manually, there is no way to easily map them to KEGG or UniProt other than providing the IDs. It would be nice to be able to pull metadata based on the sequence provided, that way we can map to PDBs easily.

ImportError: cannot import name 'Mapping' from 'collections'

I installed ssbio with python 3.10 and ipywidgets 7.0.0, an error occured:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/wook/.local/lib/python3.10/site-packages/nglview/__init__.py", line 4, in <module>
    from . import adaptor, datafiles, show, widget
  File "/home/wook/.local/lib/python3.10/site-packages/nglview/show.py", line 13, in <module>
    from .widget import NGLWidget
  File "/home/wook/.local/lib/python3.10/site-packages/nglview/widget.py", line 8, in <module>
    import ipywidgets as widgets
  File "/home/wook/.local/lib/python3.10/site-packages/ipywidgets/__init__.py", line 25, in <module>
    from .widgets import *
  File "/home/wook/.local/lib/python3.10/site-packages/ipywidgets/widgets/__init__.py", line 20, in <module>
    from .widget_selection import RadioButtons, ToggleButtons, ToggleButtonsStyle, Dropdown, Select, SelectionSlider, SelectMultiple, SelectionRangeSlider
  File "/home/wook/.local/lib/python3.10/site-packages/ipywidgets/widgets/widget_selection.py", line 9, in <module>
    from collections import Mapping, Iterable
ImportError: cannot import name 'Mapping' from 'collections' (/usr/lib/python3.10/collections/__init__.py)

It seemed that the error caused by a change in "request" module which "mapping" cannot be directly called by "collections". It is necessary to update ipywidgets rely to a newer version.

tmhmm installtion and running

Hi, I am trying to predict transmembrane protein using tmhmm-2.0c. I have changed the path of perl like as following.

First i check the location of perl
which perl /usr/bin/perl

Then i checked the version also to make sure it is above 5.
perl -v This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi (with 44 registered patches, see perl -V for more detail)

I have pasted the path (/usr/bin/perl) in the tmhmm and tmhmmformat.pl
Run the cmd:
./tmhmm --short < /mnt/genome3/Lab_Users/Kishor/DISK_3/mnspt1/blood/rsem_outdir/blood.Trinity.RSEM.retained.clustered.fasta.transdecoder.pep > tmhmm.out

But i only recevied a blank folder. It will be great for me if i get any suggestions to solve the problem. Thanks

Biopython needle alignment does not put quotations around the outfile

Download structure - fixes for python 2

Download structure - fixes for python 2 - from Eddie

try:
    import urllib.request as urllib2
    python_version = 3
except ImportError:
    import urllib2
    python_version = 2

def download_structure(pdb_id, file_type, outdir='', outfile='', only_header=False, force_rerun=False):
    """Download a structure from the RCSB PDB by ID. Specify the file type desired.

    Args:
        pdb_id: PDB ID
        file_type: pdb, pdb.gz, mmcif, cif, cif.gz, xml.gz, mmtf, mmtf.gz
        outdir: Optional output directory
        outfile: Optional output name
        only_header: If only the header file should be downloaded
        force_rerun: If the file should be downloaded again even if it exists

    Returns:
        str: Path to outfile

    """
    pdb_id = pdb_id.lower()
    file_type = file_type.lower()
    file_types = ['pdb', 'pdb.gz', 'mmcif', 'cif', 'cif.gz', 'xml.gz', 'mmtf', 'mmtf.gz']
    if file_type not in file_types:
        raise ValueError('Invalid file type, must be either: pdb, pdb.gz, cif, cif.gz, xml.gz, mmtf, mmtf.gz')

    if file_type.endswith('.gz') or file_type == 'mmtf':
        gzipped = True
    else:
        gzipped = False

    if file_type == 'mmcif':
        file_type = 'cif'

    if only_header:
        folder = 'header'
        if outfile:
            outfile = op.join(outdir, outfile)
        else:
            outfile = op.join(outdir, '{}.header.{}'.format(pdb_id, file_type))
    else:
        folder = 'download'
        if outfile:
            outfile = op.join(outdir, outfile)
        else:
            outfile = op.join(outdir, '{}.{}'.format(pdb_id, file_type))

    if ssbio.utils.force_rerun(flag=force_rerun, outfile=outfile):
        if file_type == 'mmtf.gz' or file_type == 'mmtf':
            mmtf_api = '1.0'
            download_link = 'http://mmtf.rcsb.org/v{}/full/{}.mmtf.gz'.format(mmtf_api, pdb_id)
        else:
            download_link = 'https://files.rcsb.org/{}/{}.{}'.format(folder, pdb_id, file_type)
	
 	if python_version == 2:
	   html = urllib2.urlopen(download_link).read()
	   with open(outfile, 'wb') as f:
                f.write(html)
                f.close()
	elif python_version == 3:
            urllib2.urlretrieve(download_link, outfile)
        if gzipped:
            outfile = ssbio.utils.gunzip_file(infile=outfile,
                                              outfile=outfile.strip('.gz'),
                                              delete_original=True,
                                              force_rerun_flag=force_rerun)

        log.debug('{}: Saved structure file'.format(outfile))
    else:
        log.debug('{}: Structure file already saved'.format(outfile))

    return outfile

Problematic PDB files: list of IDs

This issue will be left open for all PDBs that have strange errors...

ID: 5nc5
Problem: Biopython loading of mmcif file fails, says duplicated atom name