Giter Site home page Giter Site logo

vecaligns's Introduction


Protein Embedding Based Alignments (PEbA)


Paper can be viewed at https://www.authorea.com/users/623259/articles/646069-protein-embedding-based-alignment.

This project uses Rostlab's ProtT5-XL-UniRef50 encoder (https://huggingface.co/Rostlab/prot_t5_xl_uniref50) to embed protein sequences and then calculates the cosine similarity between each amino acid's vector to align them using traditional sequence alignment algorithms. The cosine similarity scores are used in place of substitution matrices, such as BLOSUM, with the goal of producing accurate alignments for sequences that share low character identity (<20-35%), referred to as the "twilight zone" of protein sequence alignment.

PEbA is implemented in peba.py, where it can use either the Needleman-Wunsch 'global' algorithm or the Smith-Waterman 'local' algorithm to align two sequences. It is designed to be run from the command line. The following arguments are allowed, with -embed1 and -embed2 being optional arguments if the corresponding sequences from -file1 and -file2 have already been been embedded and saved as a numpy array:

-a,  --align <str>          : Alignment algorithm to use (global or local) (default: local)
-f1, --file1 <file1.fa>     : Path to first sequence fasta file
-f2, --file2 <file2.fa>     : Path to second sequence fasta file
-e1, --embed1 <file1.txt>   : Path to first sequence embedding file
-e2, --embed2 <file2.txt>   : Path to second sequence embedding file
-go, --gopen <int/float>    : Gap opening penalty (default: -11)
-ge, --gext <int/float>     : Gap extension penalty (default: -1)
-e, --encoder <str>         : Encoder used to generate embeddings (default: ProtT5)
-o, --output <str>          : Desired output format, either 'msf' or 'fa' (default: msf)
-s, --savefile <str>        : Path to save alignment file, prints to console if not specified

Embeddings from any model can be used as long as the embeddings are a 1D array the same length as the sequence. peba.py loads embeddings using numpy.loadtext(), so if you are pre-generating your own embeddings make sure they are compatible with said numpy function. The -encoder argument is used to specify the model used and written to output for later reference.


Installing Requirements


To install PEbA, clone this repository and install the necessary requirements.

git clone https://github.com/mgtools/PEbA.git
cd PEbA
pip install -r requirements.txt

This project was developed using python 3.10.6.


Running PEbA


An example of running PEbA with two fasta files and their corresponding embeddings is shown below:

python peba.py -f1 data/example/1j46_A.fa -f2 data/example/1k99_A.fa -e1 data/example/1j46_A.txt -e2 data/example/1k99_A.txt

On default settings, this will produce the following output in the console:

PileUp



   MSF: 93  Type: P  Method: ProtT5_Sim  Gopen: -11.0  Gext: -1.0

 Name: 1j46_A oo  Len:  93  Start/End:  0,85
 Name: 1k99_A oo  Len:  93  Start/End:  6,91

//



1j46_A      MQDRVKRPMN AFIVWSRDQR RKMALENPRM RNSEISKQLG YQWKMLTEAE 
1k99_A      HPDFPKKPLT PYFRFFMEKR AKYAKLHPEM SNLDLTKILS KKYKELPEKK 

1j46_A      KWPFFQEAQK LQAMHREKYP NYKYRPRRKA KMLPK
1k99_A      KMKYIQDFQR EKQEFERNLA RFREDHPDLI QNAKK

To embed sequences, PEbA will use the ProtT5-XL-UniRef50 model and T5encoder from HuggingFace. If -e1 and -e2 are specified, meaning that the embeddings are already generated, these models will not be used.


Comparing PEbA to Reference Alignments


To determine if PEbA alignments produce more accurate alignments than those with subsitution matrices, both types of alignments are compared to reference alignments from BAliBASE3 (https://www.lbgi.fr/balibase/), found in the data/BAliBASE_R1-5 folder. BAliBASE alignments are structure based benchmarks used to compare new methods of alignment, such as this one. References RV11 and 911 contain sequences with <20% sequence identity, references RV12 and RV912 with 20-40% sequence identity, and reference RV913 with 40-80% sequence identity.

The 'Sum-of-Pairs' (SP) score is used to compare the performance between the alignments generated by this project to the reference alignments. This percentage indicates how many residue pairs are shared between two alignments. compute_score.py computes this metric between two alignments, the first being the reference alignment, and the second being the alignment being compared. An example of the output is shown below:

SP: 0.551 ref_length: 690 comparison_length: 356 similarity: 0.146

SP score is a proportion between 0 and 1, 0 representing no shared residue pairs between two alignments, and 1 representing that all residue pairs are found in both alignments. The SP score is calculated by dividing the number of shared residue pairs between the reference and test alignment by the number of pairs in the reference alignment. ref_length is the length of the reference alignment, and comparison_length is the number of residue pairs being compared (gaps are not counted). The similarity is the percentage of identical residues between the two sequences that are matched together i.e. if a pair has the same two residues, it is a match.

F1 score can also be calculated to compare alignments, but SP score is the default.


Data and Methods Compared


From each BAliBASE benchmark MSA we extracted each sequence and each pairwise alignment. We took each pair of sequences and generated alignments using PEbA with ProtT5 embeddings, PEbA with ESM2 embeddings, BLOSUM, DEDAL, vcMSA, and FATCAT (only on RV11).


Language Models and Other Methods


The following guide was used for working with ProtT5: https://github.com/agemagician/ProtTrans/blob/master/Embedding/PyTorch/Advanced/ProtT5-XL-UniRef50.ipynb We adapted this code and used it to embed sequences in embed_seqs.py.

The readme for ESM2 can be found here, which contains instructions for using the model: https://github.com/facebookresearch/esm We adapted their code and used it to embed sequences in embed_seqs.py.

The readme for DEDAL can be found here, which contains instructions for using the model: https://github.com/google-research/google-research/tree/master/dedal We adapted their code and used it to run their model in get_aligns.py.

FATCAT github can be found here: https://github.com/GodzikLab/FATCAT-dist

vcMSA github can be found here: https://github.com/clairemcwhite/vcmsa


Results and Figures


The 'data/alignments' folder contains all of the pairwise alignments that we generated and used for analysis. The 'figures' folder contains several scripts used to generate the figures and tables in the paper.


License


Licensed under the Academic Free License version 3.0.

vecaligns's People

Contributors

biovino1 avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

amelie-iska

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.