The acpsicov from akashnk

Protein Sparse Inverse COVariance analysis program for nucleotide and amino acid sequences
------------------------------------------------------------------------------------------

Installation:
------------
1. Use the make command to compile the psicov21* c code.
2. Modify the path to the two executables in the aacodonPsicov.py.
  
  
This code was used in the following paper:

Codon-level information improves predictions of inter-residue contacts in proteins by correlated mutation analysis

Etai Jacob, Ron Unger, Amnon Horovitz
Bar-Ilan University, Israel; Weizmann Institute of Science, Israel
DOI: http://dx.doi.org/10.7554/eLife.08932
Published September 15, 2015
Cite as eLife 2015;4:e08932
- See more at: http://elifesciences.org/content/4/e08932


aacodonPsicov.py:
----------------
Wrapper script for the application: Protein Sparse Inverse COVariance analysis program for nucleotide and amino acid sequences
By Etai Jacob, September 2015 - Copyright (C) 2015 Bar-Ilan University and Weizmann Institute of Science
This code is licensed under the terms of GNU General Public License v2 or later

psicov21codon.c:
---------------
PSICOVcodon - Protein Sparse Inverse COVariance analysis program for nucleotide sequences
   This code is a modified version of PSICOV for amino acid sequences. See details below.
   The following modifications were implemented:
   1) Analysis of codon alphabet
   2) 'codonnum' function

   No optimizations were performed to the code.
   
psicov21.c:
----------
From the original PSICOV code by David T. Jones:

USAGE NOTES FOR PSICOV V2.1b3
(By David T. Jones August 2011 - Copyright (C) 2011 University College London)

(See LICENSE file for details on licensing terms)


This release of PSICOV implements a number of optimizations and speedups over the original code. Firstly a
multithreaded C implementation of the glassofast code has been added. Secondly, the sequence weighting steps
are now also multithreaded. The multithreading will produce small rounding discrepancies compared to the
non-threaded code, but the differences should not be significant.


Compiling
=========

Either use the supplied makefile or compile manually:

gcc -O3 -march=native -ffast-math -m64 -ftree-vectorize -fopenmp psicov21.c -lm -o psicov

Note, compilation requires GCC 4.4 or later - in some Linux distributions, gcc 4.4 can be installed
as "gcc44". If you don't want to use multithreading to speed up calculations, then simply omit the
"-fopenmp" option or use the runtime option "-z 1" to set the maximum number of threads to 1.

If you prefer to use the Intel compiler suite, then use the following compile line:

icc -fast -fopenmp psicov21.c -lm -o psicov


Typical usage
=============

With the speedups implemented in version 2, you may as well use the slower but slightly better option below.
You will probably only need to select the faster options for very large proteins.

***********************************************************************************************************
psicov -p -d 0.03 demo.aln > output

(Run with target precision matrix density of 3% rather than fixed rho and estimate PPVs - takes 2-3x longer
than using a fixed regularization parameter, but this is now the recommended default for Version 2)
***********************************************************************************************************

Some other command line examples are shown below:

psicov -p -r 0.001 demo.aln > output

(Run with fixed default rho and estimate precision - this was the sensible default for most cases)


psicov -z 4 -p -r 0.001 -j 12 demo.aln > output

(Only output contacts separated by 12 or more in the sequence and run with at most 4 threads)


psicov -p -r 0.001 -g 0.3 demo.aln > output

(Ignore alignment columns with > 30% gaps)


Recommended options for quicker results (quick enough for large proteins):

psicov -p -r 0.001 demo.aln > output

Recommended (but not highly recommended) option for fastest possible approximate results
(quick enough for very large proteins/large alignments):

psicov -a -r 0.001 -i 62 demo.aln > output



Example using HHBLITS
=====================

Here is a simple 3-step example of predicting contacts for a target sequence using HHBLITS as the alignment
method (using the UniRef20 database). HHBLITS can be obtained from ftp://toolkit.genzentrum.lmu.de/pub/HH-suite
See HHBLITS documentation for installation instructions.

1. Search UniRef20 database and produce alignment in A3M format:
hhblits -i example.fasta -d uniprot20_2012_03 -oa3m example.a3m -n 3 -diff inf -cov 60

2. Quick command line conversion from A3M to PSICOV alignment format (and remove identical rows):
egrep -v "^>" example.a3m | sed 's/[a-z]//g' | sort -u > example.aln

3. Run PSICOV as previously described:
psicov -p -d 0.03 example.aln > example.psicov



IMPORTANT NOTES
===============

Please note that a large number of _diverse_ homologous sequences are needed for contact prediction to succeed. In
the original paper, the worst example had 511 homologous sequences and the best had 74836. If you try to calculate
contacts with very few sequences, then at best the results will be poor, or at worst psicov will struggle to
converge on a solution and keep running for ages! In the latter case, convergence can be achieved by increasing the
rho parameter to 0.005 or 0.01 say (default is 0.001) e.g. psicov -r 0.005 demo.aln > output

However, even if PSICOV does converge this will not necessarily produce good contact predictions - you cannot
avoid the issue of needing a lot of sequence data to properly compute the initial covariance matrix.

PSICOV will give an error if an alignment with too few sequences is used, and will also print a warning if there is
insufficient sequence variation. You can override these checks by using the -o flag or by modifying the code, but bear
in mind that PSICOV can only work well with VERY LARGE sequence families. If you only have a handful of sequences, or
if the sequences are all very similar, then you simply do not have enough data to analyse. For best results you need
a sequence alignment with >1000 sequences and a large amount of sequence variation across the sequences.

Also note that for long sequences, PSICOV will require a large amount of RAM (and a 64-bit system) to handle the
covariance matrices. For example, a target sequence of 300 residues will require nearly 3 Gb of RAM. The number of
sequences in the alignment, on the other hand, has relatively little effect on memory usage.
akashnk / acpsicov Goto Github PK

acpsicov's Introduction

acpsicov's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent