Giter Site home page Giter Site logo

gbwt's Introduction

GBWT

Graph BWT is an independent implementation of the graph extension (gPBWT) of the positional Burrows-Wheeler transform (PBWT). Its initial purpose is to embed observed haplotypes in a variation graph. Haplotypes are essentially sequences of nodes in the variation graph, and GBWT is best seen as the multi-string BWT of the node sequences.

See the wiki for further documentation.

Overview

GBWT is a multi-string FM-index for indexing large collections of similar paths over a graph. The paths are interpreted as sequences of node identifiers, and the sequences are stored in the FM-index. The index is stored in a number of records corresponding to the nodes of the graph. Hence the GBWT can often take advantage memory locality when the queries traverse adjacent nodes.

There are three variants of the GBWT:

  • Compressed GBWT is space-efficient and reasonably fast.
  • Dynamic GBWT is a faster and larger variant intended for index construction.
  • Cached GBWT is a cache layer on top of the compressed GBWT. It improves the performance when we repeatedly traverse se same subset of nodes.

The GBWT uses three main construction algorithms:

  • An algorithm inspired by RopeBWT2 that inserts a batch of new paths into a dynamic GBWT.
  • A variant of BWT-merge for merging multiple indexes.
  • An algorithm for quickly merging multiple indexes when the node identifiers do not overlap.

Compiling GBWT

The implementation is based on Succinct Data Structure Library 2.0 (SDSL). As the implementation uses C++11, OpenMP, and libstdc++ parallel mode, you need g++ 4.9 or newer to compile. On Apple systems, GBWT can also be built with Apple Clang 9.1, but libomp must be installed via Macports or Homebrew, and the lack of libstdc++'s parallel mode extensions will result in slower index construction.

Before compiling, set SDSL_DIR in the Makefile to point to your SDSL directory (the default is ../sdsl-lite). To compile, simply run make. Use install.sh to compile GBWT and install the headers and library to your home directory, or install.sh prefix to specify another install prefix.

GBWT is compiled with -DNDEBUG by default. Using this option is highly recommended. There are several cases, where SDSL code works correctly but the assertions are incorrect. As SDSL 2.0 is no longer actively supported, we have to wait until the release of SDSL 3.0 to fix these issues.

Citing GBWT

Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict Paten, and Richard Durbin: Haplotype-aware graph indexes. Bioinformatics 36(2):400-407, 2020. DOI: 10.1093/bioinformatics/btz575

Other references

Richard Durbin: Efficient haplotype matching and storage using the Positional Burrows-Wheeler Transform (PBWT). Bioinformatics 30(9):1266-1272, 2014. DOI: 10.1093/bioinformatics/btu014

Adam M. Novak, Erik Garrison, and Benedict Paten: A graph extension of the positional Burrows-Wheeler transform and its applications. Algorithms for Molecular Biology 12:18, 2017. DOI: 10.1186/s13015-017-0109-9

gbwt's People

Contributors

adamnovak avatar ekg avatar jltsiren avatar mr-c avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.