Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Redundant k-mer in contigs about twopaco HOT 8 OPEN

medvedevgroup commented on June 15, 2024

Redundant k-mer in contigs

from twopaco.

Comments (8)

iminkin commented on June 15, 2024 1

However, my understanding now is that, instead, k-mers are canonicalized, so that if the reverse complement of a k-mer appears somewhere else on the reference in the same orientation, it will be associated with the same node, but that the forward and reverse complement of the reference are processed separately and merged after the fact. Is this correct?

Not quite. A k-mer is glued with a k-mer if they are both spelled exactly the same way. For example, a k-mer "ACT" will be glued with another "ACT" (even if it appears on a different strand), but not with "AGT". A note: I think that there is a typo in the paper. If the length of a string constituing a vertex is of size $k$ and edges have sizes $k + 1$, then we "canonicalize" (k + 1)-mers, but it is a mere performance consideration and does not affect the way TwoPaCo constructs the graph.

So here is an example. Consider k = 3 and S+ = ACTGTTTCAGT. The de Bruijn graph looks like this:

The reverse complement of S+ is, S- = ACTGAAACAGT, and its de Bruijn graph is like this:

After we merge the graphs they look like this:

This is the graph that TwoPaCo compacts and outputs. Since GFA is "node-centric", segments will correspond to sequences spelled by edges. There will be two segments, "ACGT" and "CTGAAACAG". Let the first segment be +1, and teh second be +2. The whole input sequence is spelled by a path +1 + 2 -1. That is sort of cumbersome, because we have to go back and forth between node- an edge-centric representations here.

from twopaco.

rob-p commented on June 15, 2024 1

Hi @IlyaMinkin ,

Thanks for the detailed explanation and reference. I think I get it now. So the reason that the "palindromic" sequences have duplicate (k+1)-mers (assuming canonicalization) is because their in/out edges degrees are only counted once --- that is, if I encounter that k-mer in a plaindromic sequence, and I look at the two (k+1)-mers extending it, they are the same after canonicalization.

For example, take the sequence S = AACTGACA|TGTCAGTT, and let k=3. The | denotes where the sequence reflects over itself to form the palindrome. Clearly, there are repeated canonical (k+1)-mers in this sequence, since :

AACT = AGTT,
ACTG = CAGT,
CTGA = TCAG,
TGAC = GTCA,
GACA = TGTC

However, as I scan through the k-mers as the TwoPaCo algorithm does, each time I check for the (k+1)-mers that contain a k-mer, and canonicalize them, I end up with only a single in and out edge for these k-mers (I think this is the case for all of the, I only spot-checked a few). Thus, we won't find junction nodes and the entire sequence will be considered as a contig. Pictorially, it looks like:

 [AAC] -> [ACT] -> [TGA] -> [GAC] -> [ACA] -> [CAT] ->
                                                      \  
                                                       |
                                                      /
 [GTT]<- [AGT] <- [TCA] <- [GTC] <- [TGT] <- [ATG] <-

because, were we to reflect the first half of this sequence about the |, we would get the second half --- i.e., there is a self-edge at the [CAT | ATG] node. Thus, I was originally expecting we'd get a contig +1 = AACTGACA and we would spell out S as +1,-1. Instead, since we are "junction-free", we simply say +1 = AACTGACATGTCAGTT and spell out S = +1, despite the fact that this contig contains duplicate (k+1)-mers if we consider canonicalization. Is this interpretation correct? If so, it seems that duplicate k-mers and k+1-mers in palindromic contigs are an inherent aspect of the junction-based approach adopted by TwoPaCo. Fortunately, it also seems that these are the only such cases where duplicate k+1-mers can occur.

from twopaco.

iminkin commented on June 15, 2024

Hi @rob-p,

I will take a closer look, but for now: what is wrong with having the same k-mer at the start and end of the segment? There could be a self-loop in a de Bruijn graph. For example, if there is a path A -> B -> C -> A, after compaction it becomes A -> BC ->A and the edge spells ABCA. If the second copy of A gets truncated there will be no-self loop anymore.

from twopaco.

rob-p commented on June 15, 2024

Hi @IlyaMinkin,

I agree that the case you state is possible, and OK, if we're writing down the GFA such that segment overlaps are of size k. The concern above that the segment contains multiple repeated kmers --- not just the ones on the end of the contig. For example, take contig 214284 with sequence TGTATGTGTGTATATATGTGTGTATATATATATACACACATATATACACACATACA.

Here, if we consider the actual list of k-mers, we get (in Python notation):

['ATATATATACACACATATATACACACATACA',
 'GTATGTGTGTATATATGTGTGTATATATATA',
 'ATATATATATACACACATATATACACACATA',
 'ATGTGTGTATATATGTGTGTATATATATATA',
 'GTATATATATATACACACATATATACACACA',
 'GTGTGTATATATGTGTGTATATATATATACA',
 'GTGTATATATATATACACACATATATACACA',
 'GTGTATATATGTGTGTATATATATATACACA',
 'GTGTGTATATATATATACACACATATATACA',
 'GTATATATGTGTGTATATATATATACACACA',
 'ATGTGTGTATATATATATACACACATATATA',
 'ATATATGTGTGTATATATATATACACACATA',
 'ATATGTGTGTATATATATATACACACATATA',
 'ATATGTGTGTATATATATATACACACATATA',
 'ATATATGTGTGTATATATATATACACACATA',
 'ATGTGTGTATATATATATACACACATATATA',
 'GTATATATGTGTGTATATATATATACACACA',
 'GTGTGTATATATATATACACACATATATACA',
 'GTGTATATATGTGTGTATATATATATACACA',
 'GTGTATATATATATACACACATATATACACA',
 'GTGTGTATATATGTGTGTATATATATATACA',
 'GTATATATATATACACACATATATACACACA',
 'ATGTGTGTATATATGTGTGTATATATATATA',
 'ATATATATATACACACATATATACACACATA',
 'GTATGTGTGTATATATGTGTGTATATATATA',
 'ATATATATACACACATATATACACACATACA']

However, since we consider each k-mer equivalent to its reverse complement in the dBG, we can label each k-mer such that it receives a new label if it or its reverse complement has not yet been seen and otherwise it receives the same label that it's representative has already been assigned. In this case we get:

[(0, 'TGTATGTGTGTATATATGTGTGTATATATAT'),
 (1, 'GTATGTGTGTATATATGTGTGTATATATATA'),
 (2, 'TATGTGTGTATATATGTGTGTATATATATAT'),
 (3, 'ATGTGTGTATATATGTGTGTATATATATATA'),
 (4, 'TGTGTGTATATATGTGTGTATATATATATAC'),
 (5, 'GTGTGTATATATGTGTGTATATATATATACA'),
 (6, 'TGTGTATATATGTGTGTATATATATATACAC'),
 (7, 'GTGTATATATGTGTGTATATATATATACACA'),
 (8, 'TGTATATATGTGTGTATATATATATACACAC'),
 (9, 'GTATATATGTGTGTATATATATATACACACA'),
 (10, 'TATATATGTGTGTATATATATATACACACAT'),
 (11, 'ATATATGTGTGTATATATATATACACACATA'),
 (12, 'TATATGTGTGTATATATATATACACACATAT'),
 (12, 'ATATGTGTGTATATATATATACACACATATA'),
 (11, 'TATGTGTGTATATATATATACACACATATAT'),
 (10, 'ATGTGTGTATATATATATACACACATATATA'),
 (9, 'TGTGTGTATATATATATACACACATATATAC'),
 (8, 'GTGTGTATATATATATACACACATATATACA'),
 (7, 'TGTGTATATATATATACACACATATATACAC'),
 (6, 'GTGTATATATATATACACACATATATACACA'),
 (5, 'TGTATATATATATACACACATATATACACAC'),
 (4, 'GTATATATATATACACACATATATACACACA'),
 (3, 'TATATATATATACACACATATATACACACAT'),
 (2, 'ATATATATATACACACATATATACACACATA'),
 (1, 'TATATATATACACACATATATACACACATAC'),
 (0, 'ATATATATACACACATATATACACACATACA')]

where the first element of the tuple is the label and the second is the k-mer. So, the issue is that the original path looks like A -> B -> C -> ... M -> M -> L -> .... -> C -> B -> A, or, if we take the "forward" orientation of k-mer to be the orientation with which we first see it, this can be read off as A -> B -> C -> ... M -> rc(M) -> rc(L) -> .... -> rc(C) -> rc(B) -> rc(A). So, the problem is not just that the k-mer A appears at the beginning and end of the path, which seems unavoidable in such a situation where we are writing down overlaps of length k between segments, but rather that every k-mer in these contigs appears twice; once in the forward orientation and once in the reverse complement orientation.

Please let me know if my explanation makes sense, or if I'm misunderstanding something about the representation.

Thanks,
Rob

from twopaco.

iminkin commented on June 15, 2024

Hi @rob-p,

Sorry for a late reply. The issue here is that we assume different models for handling double strandness of the DNA. You assume that:

However, since we consider each k-mer equivalent to its reverse complement in the dBG, we can label each k-mer such that it receives a new label if it or its reverse complement has not yet been seen and otherwise it receives the same label that it's representative has already been assigned.

But in TwoPaCo it is not the case. What we do (almost) is build the de Bruijn graph for the original string S+, call it G(S+, k). Then we build the graph from the reverse complement of S+, call it S- and the graph G(S-, k). Then we merge G(S+, k) and G(S-, k) to obtain the resulting graph. In this graph a k-mer and its reverse complement are technically different k-mers as they have different sequences. In case if both direct and reverse copies of a substring are present in the input it is indicated by edges traversing through those k-mers. GFA contains sequences spelled by the edges of the graph. We slightly discuss this topic in the paper, but not much.

Unfortunately, we did not foresee the potential issues coming from different models of the graph :(

from twopaco.

rob-p commented on June 15, 2024

Hi @IlyaMinkin,

Thanks for the explanation. I think I was confused by the wording in the paper:

For a string s, let \hat{s} be its reverse
complement, and define the comprehensive de Bruijn Graph as the
graph G_comp(s,k) = G(s,k) \cup G(\hat{s},k) the graph for multiple strings
and the compacted graph is defined analogously. To build the compacted
comprehensive graph, we have to modify Algorithm 1 so that
E represents each k-mer and its reverse complement jointly. For example,
this can be done by always storing the canonical form of a
k-mer, which is the lexicographically smallest string between the
k-mer and its reverse complement (Chikhi et al., 2014). Similarly,
we have to be careful when we make membership queries to E in
Algorithm 1, so that we are always querying canonical k-mers.

Specifically, the phrase we have to modify Algorithm 1 so that E represents each k-mer and its reverse complement jointly led me to think that TwoPaCo canonicalizes the k-mers in a way that a k-mer and its reverse complement in a reference string are always associated with the same node. However, my understanding now is that, instead, k-mers are canonicalized, so that if the reverse complement of a k-mer appears somewhere else on the reference in the same orientation, it will be associated with the same node, but that the forward and reverse complement of the reference are processed separately and merged after the fact. Is this correct?

If so, I wonder if there are other ways that duplicated k-mers might sneak in other than at the edges of contigs or in "palindromic" sequences. To handle the k vs. k+1 and palindromic contig "issues", we've already written a GFA converter that transforms the TwoPaCo GFA into one where nodes of length at least k always overlap by exactly k-1 nucleotides. We also break palindromic contigs into paths of k-mers during processing to ensure they are re-merged such that their constituent k-mers appear only once in the output. This strategy seems to work for the human transcriptome and the human genome (i.e., we can produce a converted GFA that satisfies all other requirements --- every reference sequence is properly spelled out, all contigs overlap by k-1 nucleotides, and every k-mer appears exactly once). However, these examples don't constitute a correctness proof for the conversion procedure ;P.

from twopaco.

iminkin commented on June 15, 2024

Note that though "ACTG" and "CAGT" are reverse complement of each other, they yield different nodes and edges. But becuse "ACTG" is also present on the reverse strand, it will go through the same vertices as "ACTG" on the direct strand. It yields a parallel path with two edges, a blue one and a red one, showing occurrences of the same substring in both direct and reverse copies.

There is some discussion of this model in the paper: https://link.springer.com/chapter/10.1007/978-3-540-74126-8_27, though it was introduced a long time ago.

from twopaco.

iminkin commented on June 15, 2024

Hi @rob-p,

Sorry for a late reply. I think you are right.

from twopaco.

Redundant k-mer in contigs about twopaco HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent