Giter Site home page Giter Site logo

sbrg / ssbio Goto Github PK

View Code? Open in Web Editor NEW
103.0 17.0 28.0 39.11 MB

A Python framework for structural systems biology

Home Page: http://ssbio.readthedocs.io/en/latest/

License: MIT License

Python 93.01% Scheme 0.33% Perl 6.50% Jupyter Notebook 0.15%
systems-biology protein-structure structural-biology constraint-based-modeling cobrapy structural-systems-biology

ssbio's Introduction

ssbio: A Framework for Structural Systems Biology

Introduction

This Python package provides a collection of tools for people with questions in the realm of structural systems biology. The main goals of this package are to:

  1. Provide an easy way to map hundreds or thousands of genes to their encoded protein sequences and structures
  2. Directly link protein structures to genome-scale metabolic models
  3. Demonstrate fully-featured Python scientific analysis environments in Jupyter notebooks

Example questions you can (start to) answer with this package:

  • How can I determine the number of protein structures available for my list of genes?
  • What is the best, representative structure for my protein?
  • Where, in a metabolic network, do these proteins work?
  • Where do popular mutations show up on a protein?
  • How can I compare the structural features of entire proteomes?
  • How do structural properties correlate with my experimental datasets?
  • How can I improve the contents of my metabolic model with structural data?

Try it without installing

Note

Binder notebooks are still in beta, but they mostly work! Third-party programs are also preinstalled in the Binder notebooks except I-TASSER and TMHMM due to licensing restrictions.

Installation

First install NGLview using pip, then install ssbio

pip install nglview
jupyter-nbextension enable nglview --py --sys-prefix
pip install ssbio

Updating

pip install ssbio --upgrade

Uninstalling

pip uninstall ssbio

Dependencies

See: Software for a list of external programs to install, along with the functionality that they add. Most of these additional programs are used to predict or calculate properties of proteins, and are only required if you desire to calculate the described properties.

Tutorials

Check out some Jupyter notebook tutorials for a single Protein and or for many in a GEM-PRO model. See a list of all Tutorials.

Citation

The manuscript for the ssbio package can be found and cited at [1].

[1]Mih N, Brunk E, Chen K, Catoiu E, Sastry A, Kavvas E, Monk JM, Zhang Z, Palsson BO. 2018. ssbio: A Python Framework for Structural Systems Biology. Bioinformatics. https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty077/4850940.

ssbio's People

Contributors

nmih avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ssbio's Issues

Option to run QC/QA for all structures

Currently, QC/QA for structures stops when a representative structure is found. However the following case is possible:

  • Multiple parts of the structure are homology modeled
  • Another structure has a ligand or something that we are interested in

In these cases we want to have the alignment/residue mapping available for these structures but we might not have them. "set_representative_structure" doesn't sound like the right function to get that info. Also another issue is that the alignment info is stored as the "repchain_index" in the representative sequence itself. There should be a better place to store that info since we would want to map more than one structure to the sequence in these cases.

Fail to read P00533

My code:


PROJECT = 'GTspec_query'
LIST_OF_GENES = ['P01106','P01229','P01374']
PDB_FILE_TYPE = 'mmtf'

# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_list=LIST_OF_GENES, pdb_file_type=PDB_FILE_TYPE)

# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ACC+ID') 

fails and it appears to be due to a bad model in P00533. Error below

<ipython-input-2-d2f5408c2c3e> in <module>()
----> 1 execfile('02_structure_annote.py')

/media/ben/9c17f1c9-a45e-49ec-b547-8fbd2f25ccc6/GTspecificity/02_structure_annote.py in <module>()
    43
    44 # UniProt mapping
---> 45 my_gempro.uniprot_mapping_and_metadata(model_gene_source='ACC+ID')  ## [Nathan] can leave this as ACC+ID which means your list o
    46 print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
    47 my_gempro.df_uniprot_metadata.head()

/home/ben/anaconda2/lib/python2.7/site-packages/ssbio/pipeline/gempro.pyc in uniprot_mapping_and_metadata(self, model_gene_source, custo
   546                         uniprot_prop = g.protein.load_uniprot(uniprot_id=mapped_uniprot, download=True, outdir=outdir,
   547                                                               set_as_representative=set_as_representative,
--> 548                                                               force_rerun=force_rerun)
   549                     except HTTPError as e:
   550                         log.error('{}, {}: unable to complete web request'.format(g.id, mapped_uniprot))

/home/ben/anaconda2/lib/python2.7/site-packages/ssbio/core/protein.pyc in load_uniprot(self, uniprot_id, uniprot_seq_file, uniprot_xml_f
   378             if download:
   379                 uniprot_prop.download_metadata_file(outdir=outdir, force_rerun=force_rerun)
--> 380                 uniprot_prop.download_seq_file(outdir=outdir, force_rerun=force_rerun)
   381
   382             # Also check if UniProt sequence matches a potentially set representative sequence

/home/ben/anaconda2/lib/python2.7/site-packages/ssbio/databases/uniprot.pyc in download_seq_file(self, outdir, force_rerun)
   183                                                    force_rerun=force_rerun)
   184
--> 185         self.sequence_path = uniprot_fasta_file
   186
   187     def download_metadata_file(self, outdir, force_rerun=False):

/home/ben/anaconda2/lib/python2.7/site-packages/ssbio/protein/sequence/seqprop.pyc in sequence_path(self, fasta_path)

Wanted: structure viewer in Jupyter notebook

It would be nice to view a representative structure in a notebook or map mutations onto it. The idea would be to directly load structures within Jupyter notebooks perhaps using PV viewer or NGL viewer (which already has this kind of support).

mapping pdb resnum to uniprot resnum

Hey nice library. I am interested in this function for reading the SIFTS residue XML files (which are a pain to parse!)

ssbio.databases.pdb.map_uniprot_resnum_to_pdb

Do you have any tips for going the other direction (I.e. map_pdb_resnum_to_uniprot)? Otherwise could this be a useful addition?

Difficulty installing SSBIO package due to dependency conflicts

Description:
I am currently working on a metabolic modelling project and would like to utilize the SSBIO and nglview packages. However, I am encountering installation issues specifically with SSBIO. It appears that SSBIO has numerous dependencies, causing the installation process to fail when using pip. The information provided in the requirement.txt file is insufficient for successful installation attempts. I have experimented with different versions of Python and corresponding pip versions, but none have resolved the problem. The installation process halts with an error message similar to the one below, observed when using Python 3.10.12 on Google Colab:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behavior is the source of the following dependency conflicts.
yfinance 0.2.21 requires beautifulsoup4>=4.11.1, but you have beautifulsoup4 4.5.3 which is incompatible.

I have also attempted installations on various Python versions and systems, but encountered similar error messages with conflicting packages.

Request:
Could you kindly provide more precise information regarding the module version and Python version used for successful installation? Alternatively, is it possible to share a functional Docker image that incorporates the necessary dependencies? This would greatly assist me in resolving the installation challenges and proceeding with my metabolic modelling project.

Intuitive coloring of protein structures

Protein structures in NGLview could be colored by their quality, eg. red for homology models and silver for experimental models. This gives a quick idea on the quality of information being presented. Other ideas

  • "figure legend" type of thing displaying quality information (c-score, basic PDB quality)
  • running QMEAN for all structures and reporting quality this way
  • coloring certain residues based on quality (b-factor)

Issue saving PDB 5LZD

Large structures such as 5LZD run into trouble when trying to save specific chains as PDB files when setting representative structures. Not sure if I can get around that - need to save them as mmCIF somehow.

Import Error - Bio.Alphabet has been removed from Biopython

ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the ``molecule_type`` as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

The Biopython version should be specified more tightly in requirements.txt

AttributeError: 'module' object has no attribute 'urlretrieve'

I'm getting the following error when running the example notebook "GEM-PRO pipeline - one gene with sequence.ipynb".


AttributeError Traceback (most recent call last)
in ()
----> 1 my_gempro.set_representative_structure()

/home/user/ssbio/ssbio/pipeline/gempro.pyc in set_representative_structure(self, seq_outdir, struct_outdir, pdb_file_type, engine, always_use_homology, seq_ident_cutoff, allow_missing_on_termini, allow_mutants, allow_deletions, allow_insertions, allow_unresolved, force_rerun)
1109 allow_insertions=allow_insertions,
1110 allow_unresolved=allow_unresolved,
-> 1111 force_rerun=force_rerun)
1112
1113 if not repstruct:

/home/user/ssbio/ssbio/core/protein.pyc in set_representative_structure(self, seq_outdir, struct_outdir, pdb_file_type, engine, always_use_homology, seq_ident_cutoff, allow_missing_on_termini, allow_mutants, allow_deletions, allow_insertions, allow_unresolved, force_rerun)
591 # This will add all chains to the mapped_chains attribute if there are none
592 try:
--> 593 pdb.download_structure_file(outdir=struct_outdir, file_type=pdb_file_type, force_rerun=force_rerun)
594 # Download the mmCIF header file to get additional information
595 if 'cif' not in pdb_file_type:

/home/user/ssbio/ssbio/databases/pdb.pyc in download_structure_file(self, outdir, file_type, force_rerun)
44 pdb_file = download_structure(pdb_id=self.id, file_type=file_type, only_header=False,
45 outdir=outdir,
---> 46 force_rerun=force_rerun)
47 log.debug('{}: downloaded {} file'.format(self.id, file_type))
48 self.load_structure_file(pdb_file, file_type)

/home/user/ssbio/ssbio/databases/pdb.pyc in download_structure(pdb_id, file_type, outdir, outfile, only_header, force_rerun)
127
128 if python_version == 2:
--> 129 urllib2.urlretrieve(download_link, outfile)
130 html = urllib2.urlopen(download_link).read()
131 with open(outfile, 'wb') as f:

AttributeError: 'module' object has no attribute 'urlretrieve'

It occurs at the "my_gempro.set_representative_structure()" line.

Sequence to structure pairwise alignment parameter tuning

When running QC/QA, the first thing that is done is a pairwise alignment. The quality of this alignment can have a big impact of the result. The setting of gap score and gap extend penalties should be explored to prevent bad alignments -- example to be added later..

tmscore issue

"fatcat.parse_fatcat(fatcat_outfile)"

When I try to run parse_fatcat I get an error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-4-a327e7af1a57> in <module>()
----> 1 fatcat.parse_fatcat(fatcat_outfile)

/home/ben/anaconda2/lib/python2.7/site-packages/ssbio/protein/structure/properties/fatcat.pyc in parse_fatcat(fatcat_xml)
     96     # Find the tmScore of the alignment
     97     if soup.find('block'):
---> 98         fatcat_results['tm_score'] = float(soup.find('afpchain')['tmscore'])
     99 
    100     return fatcat_results

/home/ben/anaconda2/lib/python2.7/site-packages/bs4/element.pyc in __getitem__(self, key)
    995         """tag[key] returns the value of the 'key' attribute for the tag,
    996         and throws an exception if it's not there."""
--> 997         return self.attrs[key]
    998 
    999     def __iter__(self):

KeyError: 'tmscore'

There seems to be an issue with the tm_score or tmscore key

Cleaning mmtf files and saving as pdbs

Currently the most recent release of Biopython (1.6.8) doesn't parse mmtf files correctly (altlocs and disordered flags are set wrong). The developmental release does, however for now if a user has 1.6.8, cleaning structure files doesn't work properly when mmtf is set as the default. So for now -- mmcif will be set as the default file format until the next Biopython release.

See biopython/biopython#975

Issues running the tutorial

"# Index: Tutorials\n",

There are two issues I've found running this code:

  1. I think shutil.which is only availible for python3 but the ipython notbook defaults to opening in python2. I switched to using subprocess and it works fine
def check_path(path):
    """Check if the specified program is in the PATH and can be run in a shell."""
    import subprocess
    
    checker = subprocess.check_output(['which',path])
    if checker:
        print('SUCCESS: {} found!'.format(path))
        return checker
    else:
        raise OSError('FAILURE: unable to find {}'.format(path))
  1. fatcat and scratch install instructions do not create mononyms for these programs. This is easily resolved:
sudo ln -s /<path>/biojava-protein-comparison-tool-4.0.0/runFATCAT.sh /usr/bin/fatcat
sudo ln -s /media/<path>/SCRATCH-1D_1.1/bin/run_SCRATCH-1D_predictors.sh /usr/bin/scratch

Problematic PDB files

The best_structures API returns chains that are seemingly not present in these files. Find out why get_pdb_seqs doesn't have these chains. These look to be all mmCIF structures.

WARNING:ssbio.pipeline.gempro:5iqr: chain 8 not found in structure!
WARNING:ssbio.pipeline.gempro:5l3p: chain z not found in structure!
WARNING:ssbio.pipeline.gempro:5kpw: chain 33 not found in structure!
WARNING:ssbio.pipeline.gempro:5kpx: chain 33 not found in structure!
WARNING:ssbio.pipeline.gempro:5kpv: chain 33 not found in structure!

ImportError: cannot import name 'Mapping' from 'collections'

I installed ssbio with python 3.10 and ipywidgets 7.0.0, an error occured:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/wook/.local/lib/python3.10/site-packages/nglview/__init__.py", line 4, in <module>
    from . import adaptor, datafiles, show, widget
  File "/home/wook/.local/lib/python3.10/site-packages/nglview/show.py", line 13, in <module>
    from .widget import NGLWidget
  File "/home/wook/.local/lib/python3.10/site-packages/nglview/widget.py", line 8, in <module>
    import ipywidgets as widgets
  File "/home/wook/.local/lib/python3.10/site-packages/ipywidgets/__init__.py", line 25, in <module>
    from .widgets import *
  File "/home/wook/.local/lib/python3.10/site-packages/ipywidgets/widgets/__init__.py", line 20, in <module>
    from .widget_selection import RadioButtons, ToggleButtons, ToggleButtonsStyle, Dropdown, Select, SelectionSlider, SelectMultiple, SelectionRangeSlider
  File "/home/wook/.local/lib/python3.10/site-packages/ipywidgets/widgets/widget_selection.py", line 9, in <module>
    from collections import Mapping, Iterable
ImportError: cannot import name 'Mapping' from 'collections' (/usr/lib/python3.10/collections/__init__.py)

It seemed that the error caused by a change in "request" module which "mapping" cannot be directly called by "collections". It is necessary to update ipywidgets rely to a newer version.

tmhmm installtion and running

Hi, I am trying to predict transmembrane protein using tmhmm-2.0c. I have changed the path of perl like as following.

First i check the location of perl
which perl /usr/bin/perl

Then i checked the version also to make sure it is above 5.
perl -v This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi (with 44 registered patches, see perl -V for more detail)

Copyright 1987-2012, Larry Wall

I have pasted the path (/usr/bin/perl) in the tmhmm and tmhmmformat.pl
Run the cmd:
./tmhmm --short < /mnt/genome3/Lab_Users/Kishor/DISK_3/mnspt1/blood/rsem_outdir/blood.Trinity.RSEM.retained.clustered.fasta.transdecoder.pep > tmhmm.out

But i only recevied a blank folder. It will be great for me if i get any suggestions to solve the problem. Thanks

Download structure - fixes for python 2

Download structure - fixes for python 2 - from Eddie

try:
    import urllib.request as urllib2
    python_version = 3
except ImportError:
    import urllib2
    python_version = 2

def download_structure(pdb_id, file_type, outdir='', outfile='', only_header=False, force_rerun=False):
    """Download a structure from the RCSB PDB by ID. Specify the file type desired.

    Args:
        pdb_id: PDB ID
        file_type: pdb, pdb.gz, mmcif, cif, cif.gz, xml.gz, mmtf, mmtf.gz
        outdir: Optional output directory
        outfile: Optional output name
        only_header: If only the header file should be downloaded
        force_rerun: If the file should be downloaded again even if it exists

    Returns:
        str: Path to outfile

    """
    pdb_id = pdb_id.lower()
    file_type = file_type.lower()
    file_types = ['pdb', 'pdb.gz', 'mmcif', 'cif', 'cif.gz', 'xml.gz', 'mmtf', 'mmtf.gz']
    if file_type not in file_types:
        raise ValueError('Invalid file type, must be either: pdb, pdb.gz, cif, cif.gz, xml.gz, mmtf, mmtf.gz')

    if file_type.endswith('.gz') or file_type == 'mmtf':
        gzipped = True
    else:
        gzipped = False

    if file_type == 'mmcif':
        file_type = 'cif'

    if only_header:
        folder = 'header'
        if outfile:
            outfile = op.join(outdir, outfile)
        else:
            outfile = op.join(outdir, '{}.header.{}'.format(pdb_id, file_type))
    else:
        folder = 'download'
        if outfile:
            outfile = op.join(outdir, outfile)
        else:
            outfile = op.join(outdir, '{}.{}'.format(pdb_id, file_type))

    if ssbio.utils.force_rerun(flag=force_rerun, outfile=outfile):
        if file_type == 'mmtf.gz' or file_type == 'mmtf':
            mmtf_api = '1.0'
            download_link = 'http://mmtf.rcsb.org/v{}/full/{}.mmtf.gz'.format(mmtf_api, pdb_id)
        else:
            download_link = 'https://files.rcsb.org/{}/{}.{}'.format(folder, pdb_id, file_type)
	
 	if python_version == 2:
	   html = urllib2.urlopen(download_link).read()
	   with open(outfile, 'wb') as f:
                f.write(html)
                f.close()
	elif python_version == 3:
            urllib2.urlretrieve(download_link, outfile)
        if gzipped:
            outfile = ssbio.utils.gunzip_file(infile=outfile,
                                              outfile=outfile.strip('.gz'),
                                              delete_original=True,
                                              force_rerun_flag=force_rerun)

        log.debug('{}: Saved structure file'.format(outfile))
    else:
        log.debug('{}: Structure file already saved'.format(outfile))

    return outfile

Problematic PDB files: list of IDs

This issue will be left open for all PDBs that have strange errors...

ID: 5nc5
Problem: Biopython loading of mmcif file fails, says duplicated atom name

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.