Giter Site home page Giter Site logo

graph can imply unobserved sequences about vg HOT 2 CLOSED

vgteam avatar vgteam commented on July 19, 2024
graph can imply unobserved sequences

from vg.

Comments (2)

ekg avatar ekg commented on July 19, 2024

@davebiffuk is referring to the phenomenon that occurs when phasing information about the original sequences is removed and the graph is constructed using only the edit information implied by the VCF, ignoring any haplotypes that are given as input. (As of the time of this comment, this is the default mode of VCF-based construction in vg.)

This is a really interesting issue. It presents some problems but may not be completely avoidable in all contexts. I'll try to explain when it does and doesn't make sense to allow the graph to imply unobserved sequences.

Suppose this graph had been generated from two original input sequences, one red and one black:

Variation graph

Now, we are able to find many sequences in it that may never have been observed. For instance, [1,2,5,6,7]. There are very many of these. In fact, there are 48 to be exact. You can see this by running this code in the test directory of this repo:

# first we make a sub-graph by taking the head of the GFA format version of the graph
vg construct -r small/x.fa -v small/x.vcf.gz \
    | vg view - | head -28 | vg view -v - >y.vg
# then we make all paths of ~length 40, which includes all the paths in this graph
vg paths -s -l 40 y.vg | wc -l # 48

This seems problematic, but there are a few things to keep in mind.

  1. Homologous regions can support recombination. Although recombination is rare (on the order of de novo variation), it does happen and it can happen anywhere. Many species do not have specific recombination loci. These paths could exist in the event of a specific set of recombinations between the original red and black sequences.
  2. This succinct representation is large in terms of the haplotype space but small in terms of sequence space. This enables sensitive, and efficient pairwise local alignment algorithms to run natively on the variation graph.
  3. Allowing a graph to encode sequences which haven't been observed could be expedient. For example, you may not know the actual genomes that were observed, and only have information about variants and frequencies. This is actually a rather common situation, particularly when the identities of the individuals who have gone into the list of variants is private or not shareable. A variant list is easy to exchange and rather lightweight relative to a full set of haplotypes.

Problems do occur. In particular, if one samples _k_mers of a particular length naïvely from the graph which allows many recombinations between closely-spaced variants, certain regions will generate huge numbers of _k_mers, which limits our ability to map to them and in the extreme, even our ability to generate the _k_mer index of the graph (done via vg index -k N x.vg).

This issue can be mitigated in several ways.

  1. We can limit the number of edges that may be crossed when a _k_mer is generated. (To do this specify -e to vg paths, vg kmers, or vg index, as in: vg paths -s -k 21 -e 9 x.vg.
  2. We can construct using a VCF which has short haplotypes combined into a single variant (so, multiallelic variants with long lengths against the reference) and use this for graph construction. Note that this isn't yet supported, but would require only a few minor changes to begin testing.
  3. We could remove any edge that doesn't lie in a path defined by by one of the input haplotypes. This also isn't supported but would be straightforward to do, and is probably better than (2).

I haven't pursued the second two approaches because I think it should be possible for people to use vg to build graphs when they only have variant lists. This is sometimes harder than building and mapping against graphs that only contain observed haplotypes as paths. Experience with real data will likely suggest the best and most-general approach.

from vg.

ekg avatar ekg commented on July 19, 2024

I'm going to mark this as closed. The way to resolve this in the future is to store the paths of haplotypes, perhaps as compressed bitvectors, or perhaps in pBWT, perhaps both.

from vg.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.