passionlab / pastis Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 0.0 12.04 MB

Distributed Protein Sequence Aligner

License: Other

CMake 0.76% C++ 99.24%

pastis's People

Contributors

Stargazers

Watchers

pastis's Issues

ASAt or AStAt?

Related to #2, the computation of ASA^T seems logically incorrect.

S is a mapping from each k-mer to its nearest substitute k-mers.
S, however, is not symmetric.
In this regard, while a row in S makes sense, a column in S does not necessarily mean the same thing. Therefore during AS, the multiplication of a row in A with a column of S is not what we would really need, rather it seems we should perform AS^T.

Alignments should be vectorized but SeqAn only supports inter-sequence vectorization. The current code has a style that applies a pairwise function for each pair separately. This has to change to accommodate batch vectorization. Below is Saliya's email:

Yes, it's batch vectorization and global alignment (semi-global might work too assuming the vectorized API is compatible; I'll check in SeqAn). I recently got it to work with the latest SeqAn but it is not there in the dev branch yet. The reason is that batch vectorization required similar code changes as needed for the GPU integration, so I took this out of the dev branch to generalize the changes. This requires changing internal sequence representation to support both std::string (for GPU) and seqan format.

Use of transitivity

Is there a way to use transitivity of homology directly in the PISA/DISTAL? Or should we always leave that to the subsequent clustering step.

Load Imbalance in Substitute Matrix

The substitute matrix, S, shows a high load imbalance. Fixing this may require keeping a randomized mapping of k-mers to k-mer IDs.

See the email thread About Large Runs on 12/29/2019. Here's a logfile from a run that shows this effect and also fails.

Process Grid (p x p x t): 68 x 68 x 2

INFO: Program started on Sat Dec 28 20:04:12 2019

INFO: Job ID knl_fa_shuff_subs25/knl_fa_shuff_subs25_c61ed871-1547-4285-8082-f05137282334
Parameters...
  Input file (-i):                     /global/cscratch1/sd/esaliya/data/isolates/archaea/sanitized_2728834_impure_2729008_len_lte_2000_in_shuffled_isolates_proteins_archaea.fasta
  Original sequence count (-c):        2728834
  Kmer length (k):                     6
  Kmer stride (s):                     1
  Overlap in bytes (-O):               10000
  Max seed count (--sc):               1
  Gap open penalty (-g):               -11
  Gap extension penalty (-e):          -2
  Overlap file (--of):                 None
  Alignment file (--af):               knl_fa_shuff_subs25/knl_fa_shuff_subs25_align.txt
  Alignment write frequency (--afreq): 100000
  No align (--na):                     False
  Full align (--fa):                   True
  Xdrop align (--xa):                  False
  Banded align (--ba):                 False
  Index map (--idxmap):                knl_fa_shuff_subs25_archaea_idx_map.txt
  Alphabet (--alph):                   0
  Use substitute kmers (--subs):       True | sub kmers: 25
Creating fileknl_fa_shuff_subs25_archaea_idx_map.txt with 41438932 bytes
File knl_fa_shuff_subs25_archaea_idx_map.txt is actually 41438932 bytes seen from process 4623

INFO: Modfied sequence count
  Final sequence count: 2728822 (0.000440% removed)
Matrix A: 
Load imbalance: 3.118424
As a whole: 2728822 rows and 244140625 columns and 718716196 nonzeros
Matrix At: As a whole: 244140625 rows and 2728822 columns and 718716196 nonzeros
Matrix S: 
Load imbalance: 113.142021
As a whole: 244140625 rows and 244140625 columns and 723834658 nonzeros
Matrix AS: 
Load imbalance: 2.567925
As a whole: 2728822 rows and 244140625 columns and 10751320837 nonzeros
terminate called after throwing an instance of 'std::bad_alloc'

Support for DNA Sequence Alignment

The code is targetted for protein sequence alignment as of today. We need to generalize this to add support for DNA sequences.

Losing Alignments Due to Asymmetric Substitution Matrix

There are a couple of things that we may need to think in the future due to the S matrix being asymmetric. This is the first issue to think about.

We lose pairs in the resultant matrix because we only compute the top half due to this asymmetry in the resultant ASAt.

Each process in the process grid computes their upper halves only. For example P_ij and P_ji collectively computes all non-zero overlaps of process grid cell ij (or ji).
When the matrix is symmetric this works flawlessly because the top half of P_ij captures all the non-zero overlaps of the bottom half of P_ji, which is not computed by P_ji. In the same way, the top half of P_ji captures all the non-zero elements of the bottom half of P_ij.
However, when the matrix is asymmetric the above condition does not hold true anymore. To fix it we can do the following.
- Each process, P_ij, can send the non-zero pair IDs of their bottom halves to P_ji using MPI_Isend asynchornously.
- While they receive these bottom half pair IDs from their partner processes, they can compute alignments for the non-zero elements in their top-halves.
- Once, the alignments are computed and done for the partner pair IDs, each process can check which of the received pair IDs were NOT in their top-halves. Each process should perform alignments for those pair IDs that were not previously computed from this received set.

Need MPI Chunking to Support Large K Values

During the generation of sequence to k-mers matrix, A, each process keeps track of unique k-mers local to it. In order to generate S from these, processes have to communicate these to figure out the global set of unique k-mers.

The way this is done in the code is as follows.

Every process creates a boolean array of the size |Alph|^k, where |Alph| is the size of the alphabet. For proteins, it's 25^k.
This array serves as the process-local unique k-mer ID list.
Once, each process has found its list of k-mers, it participates in an MPI_Allreduction using MPI_LOR.
This results in a globally unique k-mer list.

Currently, the code relies on a single MPI_Allreduce, which limits the size of the boolean array to be less than 2³¹. This works for k=6 with proteins but will fail for anything above that for proteins as the alphabet size is 25.

The solution to this would be to use multiple MPI_Allreduce calls over parts of the boolean array.

Support for Different Scoring Matrices

Currently, the Blosum62 scoring matrix is used internally for proteins. We need to generalize this support to take in any user given scoring matrix.

Profile HMM or MSA use

Can we utilize the multiple sequence alignment (MSA) or HMM information directly in PISA/DISTAL? Let's say we have 80% ANI between two proteins. Clearly they are homologous (?), then we can build an HMM on the fly for that family and use that to increase sensitivity of subsequent queries?

passionlab / pastis Goto Github PK

pastis's People

Contributors

Stargazers

Watchers

pastis's Issues

ASAt or AStAt?

Alignment vectorization

Use of transitivity

Load Imbalance in Substitute Matrix

Support for DNA Sequence Alignment

Losing Alignments Due to Asymmetric Substitution Matrix

Need MPI Chunking to Support Large K Values

Support for Different Scoring Matrices

Profile HMM or MSA use

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent