Giter Site home page Giter Site logo

popvcf-1's Introduction

popVCF

popVCF losslessly encodes a multi sample VCF to reduce disk footprint. VCF fields are encoded by pointing to other exactly identical fields in the same row or in the row above. popVCF compression performance is small on a single sample VCF, but the compression ratio can go up to 40+ on a large population VCFs or 5x more compressed than the standard bgzip compression. The compression ratio varies a lot between data sets, see below for benchmarks on several different data sets.

Files are encoded with the "popvcf encode" command, and by encoding with the "-Oz" flag you can directly write the output in bgzip format. You can then decode the file back to VCF using the "popvcf decode" command. The decode subcommand can also query a region using option "--region=chrN:A-B".

On a 64 bit linux, you can get the latest static binary from the Release page.

Benchmarks

We have benchmarked popVCF against few other compression methods with some large population VCF data. In all experiements, we report wall clock time using /usr/bin/time and used a single CPU thread. The VCF data was read and written to a SSD disk. spVCF was run with the "--no-squeeze" option to prevent any lossy compression. The script run to benchmark is in the benchmark/ directory. In the WGS benchmarks, we had to exclude genozip and VCFShark as they were unable to compress the data because of repeated runtime errors.

Benchmarked versions: popVCF v1.1.0, spVCF v1.2.0-0-gbecb461, htslib+bcftools v1.14 (with libdeflate), Genozip 13.0.11, VCFShark v1.1.

GraphTyper UK biobank WGS-487k individual data

Method/format Compression ratio Compared to bgzip
popVCF+bgzip 37.6x 4.4x
spVCF+bgzip 17.2x 2.0x
BCF 10.5x 1.2x
bgzip (VCF) 8.6x 1.0x

Deep Variant/GLnexus WES-200k individual data

Method/format Compression ratio Compared to bgzip Compression speed (MB/s) Decompression speed (MB/s)
popVCF+bgzip 102.9x 6.9x 194.0 490.7
spVCF+bgzip 43.8x 2.9x 129.7 281.5
Genozip 35.0x 2.3x 18.0 17.3
VCFShark 28.3x 1.9x 22.8 21.7
BCF 14.0x 0.94x 62.4 175.2
bgzip (VCF) 14.9x 1.0x 91.6 521.3

GATK UK biobank WGS-150k individual data

Method/format Compression ratio Compared to bgzip Compression speed (MB/s) Decompression speed (MB/s)
popVCF+bgzip 20.1x 2.8x 102.2 295.0
spVCF+bgzip 10.0x 1.4x 58.8 165.7
BCF 6.7x 0.94x 55.6 174.2
bgzip (VCF) 7.1x 1.0x 58.5 474.7

Usage

popvcf encode my.vcf > my.popvcf
popvcf decode my.popvcf > my.new.vcf
diff my.vcf my.new.vcf # Should be the same

# It is also possible to bgzip, tabix index and query
popvcf encode my.vcf -Oz > my.popvcf.gz
tabix my.popvcf.gz
popvcf decode my.popvcf.gz > my.new2.vcf
popvcf decode my.popvcf.gz --region=chrN:A-B > my.region.vcf # Random access a region using the tabix index

Building

Feature complete C++17 compiler is required for building popVCF, i.e. GCC 8/Clang 10 or newer.

git clone --recursive <url> popvcf # Clone the repository
cd popvcf
mkdir build-release
cd build-release
cmake ..
make -j3 popvcf

License

MIT

popvcf-1's People

Contributors

hannespetur avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.