passionlab / pastis Goto Github PK
View Code? Open in Web Editor NEWDistributed Protein Sequence Aligner
License: Other
Distributed Protein Sequence Aligner
License: Other
Related to #2, the computation of ASAT seems logically incorrect.
Alignments should be vectorized but SeqAn only supports inter-sequence vectorization. The current code has a style that applies a pairwise function for each pair separately. This has to change to accommodate batch vectorization. Below is Saliya's email:
Yes, it's batch vectorization and global alignment (semi-global might work too assuming the vectorized API is compatible; I'll check in SeqAn). I recently got it to work with the latest SeqAn but it is not there in the dev branch yet. The reason is that batch vectorization required similar code changes as needed for the GPU integration, so I took this out of the dev branch to generalize the changes. This requires changing internal sequence representation to support both std::string (for GPU) and seqan format.
Is there a way to use transitivity of homology directly in the PISA/DISTAL? Or should we always leave that to the subsequent clustering step.
The substitute matrix, S
, shows a high load imbalance. Fixing this may require keeping a randomized mapping of k-mers to k-mer IDs.
See the email thread About Large Runs
on 12/29/2019
. Here's a logfile from a run that shows this effect and also fails.
Process Grid (p x p x t): 68 x 68 x 2
INFO: Program started on Sat Dec 28 20:04:12 2019
INFO: Job ID knl_fa_shuff_subs25/knl_fa_shuff_subs25_c61ed871-1547-4285-8082-f05137282334
Parameters...
Input file (-i): /global/cscratch1/sd/esaliya/data/isolates/archaea/sanitized_2728834_impure_2729008_len_lte_2000_in_shuffled_isolates_proteins_archaea.fasta
Original sequence count (-c): 2728834
Kmer length (k): 6
Kmer stride (s): 1
Overlap in bytes (-O): 10000
Max seed count (--sc): 1
Gap open penalty (-g): -11
Gap extension penalty (-e): -2
Overlap file (--of): None
Alignment file (--af): knl_fa_shuff_subs25/knl_fa_shuff_subs25_align.txt
Alignment write frequency (--afreq): 100000
No align (--na): False
Full align (--fa): True
Xdrop align (--xa): False
Banded align (--ba): False
Index map (--idxmap): knl_fa_shuff_subs25_archaea_idx_map.txt
Alphabet (--alph): 0
Use substitute kmers (--subs): True | sub kmers: 25
Creating fileknl_fa_shuff_subs25_archaea_idx_map.txt with 41438932 bytes
File knl_fa_shuff_subs25_archaea_idx_map.txt is actually 41438932 bytes seen from process 4623
INFO: Modfied sequence count
Final sequence count: 2728822 (0.000440% removed)
Matrix A:
Load imbalance: 3.118424
As a whole: 2728822 rows and 244140625 columns and 718716196 nonzeros
Matrix At: As a whole: 244140625 rows and 2728822 columns and 718716196 nonzeros
Matrix S:
Load imbalance: 113.142021
As a whole: 244140625 rows and 244140625 columns and 723834658 nonzeros
Matrix AS:
Load imbalance: 2.567925
As a whole: 2728822 rows and 244140625 columns and 10751320837 nonzeros
terminate called after throwing an instance of 'std::bad_alloc'
The code is targetted for protein sequence alignment as of today. We need to generalize this to add support for DNA sequences.
There are a couple of things that we may need to think in the future due to the S matrix being asymmetric. This is the first issue to think about.
We lose pairs in the resultant matrix because we only compute the top half due to this asymmetry in the resultant ASAt.
Each process in the process grid computes their upper halves only. For example Pij and Pji collectively computes all non-zero overlaps of process grid cell ij (or ji).
When the matrix is symmetric this works flawlessly because the top half of Pij captures all the non-zero overlaps of the bottom half of Pji, which is not computed by Pji. In the same way, the top half of Pji captures all the non-zero elements of the bottom half of Pij.
However, when the matrix is asymmetric the above condition does not hold true anymore. To fix it we can do the following.
Each process, Pij, can send the non-zero pair IDs of their bottom halves to Pji using MPI_Isend asynchornously.
While they receive these bottom half pair IDs from their partner processes, they can compute alignments for the non-zero elements in their top-halves.
Once, the alignments are computed and done for the partner pair IDs, each process can check which of the received pair IDs were NOT in their top-halves. Each process should perform alignments for those pair IDs that were not previously computed from this received set.
During the generation of sequence to k-mers matrix, A, each process keeps track of unique k-mers local to it. In order to generate S from these, processes have to communicate these to figure out the global set of unique k-mers.
The way this is done in the code is as follows.
MPI_Allreduction
using MPI_LOR
.Currently, the code relies on a single MPI_Allreduce
, which limits the size of the boolean array to be less than 231. This works for k=6
with proteins but will fail for anything above that for proteins as the alphabet size is 25.
The solution to this would be to use multiple MPI_Allreduce
calls over parts of the boolean array.
Currently, the Blosum62 scoring matrix is used internally for proteins. We need to generalize this support to take in any user given scoring matrix.
Can we utilize the multiple sequence alignment (MSA) or HMM information directly in PISA/DISTAL? Let's say we have 80% ANI between two proteins. Clearly they are homologous (?), then we can build an HMM on the fly for that family and use that to increase sensitivity of subsequent queries?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.