Giter Site home page Giter Site logo

ammaraziz / compression_benchmark Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mbhall88/compression_benchmark

0.0 0.0 0.0 3.08 MB

Benchmarking fastq compression with generic (mature) compression algorithms

License: MIT License

Shell 3.86% Python 96.14%

compression_benchmark's Introduction

Compression benchmark

DOI

Benchmarking fastq compression with generic (mature) compression algorithms

Motivation

This behcmark is motivated by a recent question from Ryan Connor on the µbioinfo Slack group

my impression is that bioinformatics really likes gzip (and only gzip?), but that there are other generic compression algs that are better (for bioinfo data types); assuming you agree (if not, why not?), why haven't the others compression types caught on in bioinformatics?

It kicked off an interesting discussion, which led me to dig into the literature and see what I could find. I'm sure I could search deeper and for longer, but I really couldn't find any benchmarks that satisfied me. Don't get me wrong, there are plenty of benchmarks, but they're always looking at bioinformatics-specific tools for compressing sequencing data. Sure, these perform well, but every repository I went to was untouched in a while. When archiving data, the last thing I want is to try and decompress my data and the tool no longer installs/works on my system. In addition, I want the tool to be ubiquitous and mature. I know this is a lot of constraints, but hey, that's what I am interested in.

This benchmark only covers ubiquitous/mature/generic compression tools

Methods

Tools

The tools tested in this benchmark are:

These tools were used as they were the main ones that popped up in our discussion. Feel free to raise an issue on this repository if you would like to see another tool included.

All compression level settings were tested for each tool and default settings were used for all other options.

Data

The data used to test each tool are fastqs:

Nanopore

Illumina

Note: I couldn't find sources for all of these samples. If you can fill in some of the gaps, please raise an issue and I will gladly update the sources.

All data were downloaded with fastq-dl (v2.0.1). Paired Illumina data were combined into a single fastq file.

Results

Compression ratio

The first question is how much smaller does each compression tool make a fastq file. As this also depends on the compression level selected, all possible levels were tested for each tool (the default being indicated with a red circle).

The compression ratio is a percentage of the original file size - i.e., $\frac{\text{compressed size}}{\text{uncompressed size}}$.


Compression ratio figure

Figure 1: Compression ratio (y-axis) for different compression tools and levels. Compression ratio is a percentage of the original file size. The red circles indicate the default compression level for each tool. Illumina data is represented with a solid line and circular points, whereas Nanopore data is a dashed line with cross points. Translucent error bands represent the 95% confidence interval.


The most striking result here is the noticeable different in compression ratio between Illumina and Nanopore data - regardless of the compression tool used. (If anyone can suggest a reason for this, please raise an issue.)

Update 07/06/2023: Peter Menzel mentioned this is likely due to the noisier quality scores in the Nanopore data. Illumina quality scores are generally quite homogenous, which increases compressability.

Using default settings, zstd and gzip provide similar ratios, as do xz and bzip2 (however, compression level doesn't seem to actually change the ratio for bzip2). When using the highest compression level xz provides the best compression (however, this comes at a cost to runtime as we'll see below).

(De)compression rate and memory usage

In many scenarios, the (de)compression rate is just as important as the compression ratio. However, if compressing for archival purposes, rate is probably not as important.

The compression rate is $\frac{\text{uncompressed size}}{\text{(de)compression time ( secs)}}$.


Compression rate figure

Figure 2: Compression (left column) and decompression (right column) rate (top row) and peak memory usage (lower row). Note the log scale for rate. The red circles indicate the default compression level for each tool. Illumina data is represented with a solid line and circular points, whereas Nanopore data is a dashed line with cross points. Translucent error bands represent the 95% confidence interval.


As alluded to earlier, xz pays for its fantastic compression ratios by being orders-of-magnitude slower than the other tools. It also uses a lot more memory than the other tools when (de)compressing - although in absolute terms, the highest memory usage is still well below 1GB.

The main take-away from Figure 2 is that zstd (de)compresses much faster than the other tools (using the default level). Compression level seems to have a big impact in compression rate (except for bzip2), however, not so much for decompression.

Conclusion

So what tool to use? As most often with benchmarks: it depends on your situation.

If all you care about is compressing your data as small as it will go ,and you don't mind how long it takes, then xz - with compression level 9 - is the obvious choice.

If you want fast (de)compression, then zstd is the best option - using default options.

If, like most people, you're contemplating replacing gzip (default options), the siutation is a little less clear. As a drop-in replacement, zstd (default options) will give you about the same compression ratio with ~10-fold faster compression and ~3-5-fold faster decompression. Another option is bzip2, which will give you ~1.2-fold better compression ratios than gzip ( and zstd) with a comparable compression rate to gzip. However, bzip2's decompression rate is ~5-fold slower than gzip.

One final consideration is APIs for various programming languages. If it is difficult to read/write files that are compressed with a given algorithm, then using that compression type might cause problems. Most (good) bioinformatics tools support gzip-compressed input and output. However, support for other compression types shouldn't be too much work for most software tool developers provided a well-maintained and documented API is available in the relevant programming language. Here is a list of APIs for the tested compression tools in a selection of programming languages with an arbitrary grading system for how "stable" I think they are (feel free to put in a pull request if you want to contribute other languages).

gzip bzip2 xz zstd
Python A A A B+
Rust A B+ B+ B
C/C++ A A A A
  • A: standard library (i.e. builtin) or library is maintained by the original developer (note: Rust's gzip library is maintained by rust-lang itself)
  • B: external library that is actively maintained, well-documented, and has quick response times

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.