Giter Site home page Giter Site logo

glia's Introduction

glia

a Graph/Smith-Waterman (partial order) aligner/realigner

Erik Garrison [email protected]

Deniz Kural [email protected]

Installation Instructions

Make sure you download the submodules by adding the recursive flag to the clone command:

git clone --recursive https://github.com/ekg/glia.git

Then build with a call to make:

cd glia
make

You'll need cmake to compile bamtools, which is a submoduled dependency. ZLIB is also required, but is typically installed system-wide on development servers.

Example usage

glia consumes a VCF, a target region, and a string to generate an alignment of any string against the variant graph defined by the VCF. For instance calling:

glia -f ~/human_g1k_v37.fasta -t 20:62900077-62902077 -v variants.vcf.gz \
     -s AAATGTAAACATTTTATAGGGGATTCCCCTAAAAACAAAAAAACTTTCTGGGAAAGATTTTTCAAAAAATAAAA

Will report the alignment of this string against the variant graph in the region 20:62900077-62902077. This mode is provided for debugging purposes.

Streaming local realignment

glia's main use is as a local realigner. It will realign reads to a set of known (or putative) variants in a VCF, both consuming and producing an ordered stream of BAM alignments. As such, it can be used immediately prior to variant calling with standard methods. In this case, we use freebayes:

<sample.bam glia -Rr -w 1500 -S 200 -Q 200 -G 4 -f ref.fasta -v known_variants.vcf.gz \
    | freebayes -f ref.fasta --stdin >sample.vcf

Here, glia is running reads that have a soft-clip ('S' cigar element) quality sum of >200 (-S 200), or a mismatch quality sum >200 (-Q 200) or any indel (default), and only accepting alignments that have fewer than 4 embedded gaps and decrease the overall quality score sum of soft-clips and mismatches. So as to match unaligned mates to their correct alignments, this invocation of glia will align reads to the variant graph constructed across a window of 1500bp semi-centered around the read alignment position (-w 1500). These settings are typically sensible for realigning Illumina 70bp reads as found in the 1000 Genomes Project.

Why?

The alignment of reads can be frustrated by variants, both large (e.g. structural variants) and small (e.g. SNPs, small indels). Observations of larger variants, where the variant scale is of a large fraction of a read length or greater, can be rescued by using unaligned mates from paired sequencing. For smaller variants, more typically contemporary aligners will soft-clip starts and ends of a read when the read contains multiple mismatches or gaps. Soft-clipping makes sense in that the tails of reads often contain many sequencing errors, but this filtering practice can reduce sensitivity of resequencing even to small variants.

glia realigns poorly-aligned reads against a local variant graph in order to resolve these biases, provided we know about, or have a good hint that a variant potentially exists in our dataset. Suitable input could be either a known variant set such as dbSNP or a union set of variants such as might be produced by an ensemble variant detection process).

How?

glia uses an independently-developed (by Deniz Kural) version of partial order alignment algorithm to realign reads to a local "variant graph," "sequence graph," or "partial order." This structure was originally applied to the problem of multiple sequence alignment due to its favorable time and space complexity, which is approximately O(NM) where N is the length of the query and M is the total length of the partial-order graph against which we are aligning. In glia, we apply this method to read alignment and variant detection in the context of the sequence reads generated by 2nd and 3rd-generation sequencing platforms.

Considering a multiple sequence alignment, we can construct a partially-ordered graph in which all of the input sequences are paths:

partial order graph construction

(Note that these figures are drawn from Christopher Lee's original paper on partial order alignment)

This is the structure which glia aligns reads against. Provided a VCF file encodes the same structure and does not include internal conflicts, we can use it for the generation of partially-ordered, directed acyclic graphs.

In partial order (or graph-Smith-Waterman) alignment, the standard dynamic-programming alignment algorithm is generalized to allow alignment across edges of this graph. This can be understood as a modification of the recurrence relation used to define the score distribution across the alignment matrix to account for the score on the upstream side of inbound links. When we initialize a new matrix, we take the score and identity of the node with the maximum alignment score in the same position in the previous node matrix.

partial order alignment

This then allows us to trace back the alignment across the edges of the graph. In practice, glia uses this information to determine if the graph can provide a better alignment of a given read. When the new alignment has better metrics in terms of quality/mismatch or soft-clip, then it is accepted, "flattened" from the graph back into the reference space, and emitted as a BAM record for downstream processing. If the new alignment is not an improvement, or if there are other problems, then the original alignment is provided as-is.

Help!!!?

Email [email protected] with any questions or concerns. Please report bugs via the bugtracking system in github.

glia's People

Contributors

denizkural avatar ekg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

glia's Issues

git clone fails due to broken link

It looks like the rapidjson github repository has moved (potentially to here), and as a result, git clone --recursive no longer works on this repository.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.