Giter Site home page Giter Site logo

pastis's People

Contributors

aydinbuluc avatar esaliya avatar giuliaguidi avatar roguzsel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pastis's Issues

ASAt or AStAt?

Related to #2, the computation of ASAT seems logically incorrect.

  • S is a mapping from each k-mer to its nearest substitute k-mers.
  • S, however, is not symmetric.
  • In this regard, while a row in S makes sense, a column in S does not necessarily mean the same thing. Therefore during AS, the multiplication of a row in A with a column of S is not what we would really need, rather it seems we should perform AST.

Alignment vectorization

Alignments should be vectorized but SeqAn only supports inter-sequence vectorization. The current code has a style that applies a pairwise function for each pair separately. This has to change to accommodate batch vectorization. Below is Saliya's email:

Yes, it's batch vectorization and global alignment (semi-global might work too assuming the vectorized API is compatible; I'll check in SeqAn). I recently got it to work with the latest SeqAn but it is not there in the dev branch yet. The reason is that batch vectorization required similar code changes as needed for the GPU integration, so I took this out of the dev branch to generalize the changes. This requires changing internal sequence representation to support both std::string (for GPU) and seqan format.

Use of transitivity

Is there a way to use transitivity of homology directly in the PISA/DISTAL? Or should we always leave that to the subsequent clustering step.

Load Imbalance in Substitute Matrix

The substitute matrix, S, shows a high load imbalance. Fixing this may require keeping a randomized mapping of k-mers to k-mer IDs.

See the email thread About Large Runs on 12/29/2019. Here's a logfile from a run that shows this effect and also fails.

Process Grid (p x p x t): 68 x 68 x 2

INFO: Program started on Sat Dec 28 20:04:12 2019

INFO: Job ID knl_fa_shuff_subs25/knl_fa_shuff_subs25_c61ed871-1547-4285-8082-f05137282334
Parameters...
  Input file (-i):                     /global/cscratch1/sd/esaliya/data/isolates/archaea/sanitized_2728834_impure_2729008_len_lte_2000_in_shuffled_isolates_proteins_archaea.fasta
  Original sequence count (-c):        2728834
  Kmer length (k):                     6
  Kmer stride (s):                     1
  Overlap in bytes (-O):               10000
  Max seed count (--sc):               1
  Gap open penalty (-g):               -11
  Gap extension penalty (-e):          -2
  Overlap file (--of):                 None
  Alignment file (--af):               knl_fa_shuff_subs25/knl_fa_shuff_subs25_align.txt
  Alignment write frequency (--afreq): 100000
  No align (--na):                     False
  Full align (--fa):                   True
  Xdrop align (--xa):                  False
  Banded align (--ba):                 False
  Index map (--idxmap):                knl_fa_shuff_subs25_archaea_idx_map.txt
  Alphabet (--alph):                   0
  Use substitute kmers (--subs):       True | sub kmers: 25
Creating fileknl_fa_shuff_subs25_archaea_idx_map.txt with 41438932 bytes
File knl_fa_shuff_subs25_archaea_idx_map.txt is actually 41438932 bytes seen from process 4623

INFO: Modfied sequence count
  Final sequence count: 2728822 (0.000440% removed)
Matrix A: 
Load imbalance: 3.118424
As a whole: 2728822 rows and 244140625 columns and 718716196 nonzeros
Matrix At: As a whole: 244140625 rows and 2728822 columns and 718716196 nonzeros
Matrix S: 
Load imbalance: 113.142021
As a whole: 244140625 rows and 244140625 columns and 723834658 nonzeros
Matrix AS: 
Load imbalance: 2.567925
As a whole: 2728822 rows and 244140625 columns and 10751320837 nonzeros
terminate called after throwing an instance of 'std::bad_alloc'

Losing Alignments Due to Asymmetric Substitution Matrix

There are a couple of things that we may need to think in the future due to the S matrix being asymmetric. This is the first issue to think about.

We lose pairs in the resultant matrix because we only compute the top half due to this asymmetry in the resultant ASAt.

  • Each process in the process grid computes their upper halves only. For example Pij and Pji collectively computes all non-zero overlaps of process grid cell ij (or ji).

  • When the matrix is symmetric this works flawlessly because the top half of Pij captures all the non-zero overlaps of the bottom half of Pji, which is not computed by Pji. In the same way, the top half of Pji captures all the non-zero elements of the bottom half of Pij.

  • However, when the matrix is asymmetric the above condition does not hold true anymore. To fix it we can do the following.

    • Each process, Pij, can send the non-zero pair IDs of their bottom halves to Pji using MPI_Isend asynchornously.

    • While they receive these bottom half pair IDs from their partner processes, they can compute alignments for the non-zero elements in their top-halves.

    • Once, the alignments are computed and done for the partner pair IDs, each process can check which of the received pair IDs were NOT in their top-halves. Each process should perform alignments for those pair IDs that were not previously computed from this received set.

Need MPI Chunking to Support Large K Values

During the generation of sequence to k-mers matrix, A, each process keeps track of unique k-mers local to it. In order to generate S from these, processes have to communicate these to figure out the global set of unique k-mers.

The way this is done in the code is as follows.

  1. Every process creates a boolean array of the size |Alph|k, where |Alph| is the size of the alphabet. For proteins, it's 25k.
  2. This array serves as the process-local unique k-mer ID list.
  3. Once, each process has found its list of k-mers, it participates in an MPI_Allreduction using MPI_LOR.
  4. This results in a globally unique k-mer list.

Currently, the code relies on a single MPI_Allreduce, which limits the size of the boolean array to be less than 231. This works for k=6 with proteins but will fail for anything above that for proteins as the alphabet size is 25.

The solution to this would be to use multiple MPI_Allreduce calls over parts of the boolean array.

Profile HMM or MSA use

Can we utilize the multiple sequence alignment (MSA) or HMM information directly in PISA/DISTAL? Let's say we have 80% ANI between two proteins. Clearly they are homologous (?), then we can build an HMM on the fly for that family and use that to increase sensitivity of subsequent queries?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.