What are the implications of the graph encoding (i.e. implying the existence of) seque

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

graph can imply unobserved sequences about vg HOT 2 CLOSED

vgteam commented on July 19, 2024

graph can imply unobserved sequences

from vg.

Comments (2)

ekg commented on July 19, 2024

@davebiffuk is referring to the phenomenon that occurs when phasing information about the original sequences is removed and the graph is constructed using only the edit information implied by the VCF, ignoring any haplotypes that are given as input. (As of the time of this comment, this is the default mode of VCF-based construction in vg.)

This is a really interesting issue. It presents some problems but may not be completely avoidable in all contexts. I'll try to explain when it does and doesn't make sense to allow the graph to imply unobserved sequences.

Suppose this graph had been generated from two original input sequences, one red and one black:

Now, we are able to find many sequences in it that may never have been observed. For instance, [1,2,5,6,7]. There are very many of these. In fact, there are 48 to be exact. You can see this by running this code in the test directory of this repo:

# first we make a sub-graph by taking the head of the GFA format version of the graph
vg construct -r small/x.fa -v small/x.vcf.gz \
    | vg view - | head -28 | vg view -v - >y.vg
# then we make all paths of ~length 40, which includes all the paths in this graph
vg paths -s -l 40 y.vg | wc -l # 48

This seems problematic, but there are a few things to keep in mind.

Homologous regions can support recombination. Although recombination is rare (on the order of de novo variation), it does happen and it can happen anywhere. Many species do not have specific recombination loci. These paths could exist in the event of a specific set of recombinations between the original red and black sequences.
This succinct representation is large in terms of the haplotype space but small in terms of sequence space. This enables sensitive, and efficient pairwise local alignment algorithms to run natively on the variation graph.
Allowing a graph to encode sequences which haven't been observed could be expedient. For example, you may not know the actual genomes that were observed, and only have information about variants and frequencies. This is actually a rather common situation, particularly when the identities of the individuals who have gone into the list of variants is private or not shareable. A variant list is easy to exchange and rather lightweight relative to a full set of haplotypes.

Problems do occur. In particular, if one samples _k_mers of a particular length naïvely from the graph which allows many recombinations between closely-spaced variants, certain regions will generate huge numbers of _k_mers, which limits our ability to map to them and in the extreme, even our ability to generate the _k_mer index of the graph (done via vg index -k N x.vg).

This issue can be mitigated in several ways.

We can limit the number of edges that may be crossed when a _k_mer is generated. (To do this specify -e to vg paths, vg kmers, or vg index, as in: vg paths -s -k 21 -e 9 x.vg.
We can construct using a VCF which has short haplotypes combined into a single variant (so, multiallelic variants with long lengths against the reference) and use this for graph construction. Note that this isn't yet supported, but would require only a few minor changes to begin testing.
We could remove any edge that doesn't lie in a path defined by by one of the input haplotypes. This also isn't supported but would be straightforward to do, and is probably better than (2).

I haven't pursued the second two approaches because I think it should be possible for people to use vg to build graphs when they only have variant lists. This is sometimes harder than building and mapping against graphs that only contain observed haplotypes as paths. Experience with real data will likely suggest the best and most-general approach.

from vg.

ekg commented on July 19, 2024

I'm going to mark this as closed. The way to resolve this in the future is to store the paths of haplotypes, perhaps as compressed bitvectors, or perhaps in pBWT, perhaps both.

from vg.

graph can imply unobserved sequences about vg HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent