Giter Site home page Giter Site logo

Comments (9)

adamkewley avatar adamkewley commented on September 7, 2024

Just so I understand, this optimization is good for DNA files because having a smaller match length causes individual literals to be encoded using a more efficient Huffman encoding (because the resulting tree will only have LZ matches > ~8 in length in the lcode tree), and DNA reads are mostly random-looking data?

from libdeflate.

chrchang avatar chrchang commented on September 7, 2024

This issue concerns Illumina-sequencing read identifiers, not the actual DNA data in the read.

from libdeflate.

adamkewley avatar adamkewley commented on September 7, 2024

Sorry, I was confused by the formatting a little. So, in this particular case you see an improvement when setting the min match length because it will LZ the head of the ID (repeats heavily) then Huffman the tail, interesting.

from libdeflate.

ebiggers avatar ebiggers commented on September 7, 2024

Do you have any sample files available I could test with? I'd like to investigate whether this case could be handled better automatically using some heuristics. Adding more options increases API complexity so is generally something I'd like to avoid.

from libdeflate.

jkbonfield avatar jkbonfield commented on September 7, 2024

Many thanks for the quick reply. A heuristic if possible would be better still. I was just thinking about what would be simple (and offering to do it if it's something you'd want). I don't understand it well enough to work out where to put heuristics though.

Take a look at the data in https://github.com/jkbonfield/htscodecs-corpus/tree/master/names

A quick run through all the *.names files with and without a +4 hack as above:

Original

$ for i in tests/htscodecs-corpus/names/*.names;do ~/ftp/compression/libdeflate/gzip < $i 2>/dev/null|wc -c;done
72201
79313
101113
55364
100884
53751
75087
31300
59432
19211
177192

With change:

$ for i in tests/htscodecs-corpus/names/*.names;do ~/ftp/compression/libdeflate/gzip < $i 2>/dev/null|wc -c;done
66852
72401
92062
50348
93563
54053
68551
32222
54043
17307
161436

It generally helps a lot, but not universally.
On this data I've wondered whether a custom algorithm that creates deflate format using purely line-oriented prefix matching would give the best speed for size tradeoff. (However for slower modes of my work I have a dedicated (non-deflate) name compressor anyway, so the impetus to be clever isn't huge.)

from libdeflate.

jkbonfield avatar jkbonfield commented on September 7, 2024

Sorry, I was confused by the formatting a little. So, in this particular case you see an improvement when setting the min match length because it will LZ the head of the ID (repeats heavily) then Huffman the tail, interesting.

I haven't analysed quite what it's doing, but my thinking was the form of these strings is basically common prefix (instrument name) then :lane:tile:x:y. There is some dictionary component to tile, but the x and y are largely random. There may be lots of X and Y strings and substrings just because with only 10 characters the digits have a good chance of random matching (1 in 100 if we include colon + 2 digits).

It's quite easy for parsers to generate an LZ match for "Y newline prefix:lane" instead of just "prefix:lane", possibly at the expense of making a longer distance when it's not really justified for random chance match. (I assume that's what the optimal parser is trying to judge though.) Changing the min match length doesn't stop that so it's not what my change achieved clearly. I did try hacking it to require a newline starting character but that turned out to be tragic, so canned that idea quickly.

from libdeflate.

jkbonfield avatar jkbonfield commented on September 7, 2024

Another test, this time with DNA sequence data, also shows Zlib's Z_DEFAULT_STRATEGY vs Z_FILTERED to be a win. I'll put the data file in ftp://ftp.sanger.ac.uk/pub/users/jkb/seq

      Z_DEFAULT  Z_FILTERED   Default+Opt    Default-Opt  MIN_10 -Opt   7za
1     11185034   11185034     8807052        8807052      6767390
2     10172931   10172931     7672585        7672585      6329695
3      9300280    9300280     7453628        7453628      6208820
4      7606177    6803282     7310028        7310028      6102399
5      6904720    6537932     6765513        6765513      6027264
6      6658313    6447436     6600476        6600476      5897747
7      6592924    6440727     6494304        6494304      5831461
8      6448238    6325678     6514868        6434674      5790594
9      6427283    6303307     6217224        6385173      5751458       5722540
10     		  	      6086605
11     			      5977583
12     			      5909323

That shows Zlib with Z_DEFAULT_STRATEGY vs Z_FILTERED, Libdeflate as cloned (optimal parsing enabled for level 8-12), libdeflate with SUPPORT_NEAR_OPTIMAL_PARSING disabled (levels 8,9 only), libdeflate with SUPPORT_NEAR_OPTIMAL_PARSING disabled and +7 on the existing min_match len of 3, and finally max compression of 7zip in gzip mode.

You can see by default libdeflate beats zlib's defaults. Zlib gains a bit using Z_FILTERED, but it's still beaten when we use higher levels of libdeflate. However libdefault with a longer minimum match length trounces everything bar 7za and it gets pretty close to matching that too.

For what it's worth, I see the same benefits in zlib. Z_FILTERED is changing min match from 3 to 5. Bumping it to 10 manually shows level 6 gives 5935354 and level 9 5773148, so again we see potential for manual tweakage there too.

PS. I don't know how to limit the minimum match length for the optimal parser, so no numbers have been presented there.

from libdeflate.

ebiggers avatar ebiggers commented on September 7, 2024

I implemented some heuristics which handle setting the minimum match length automatically. Commit 69a7ca0 is the main one.

Here are the results for compressing all the .names files from https://github.com/jkbonfield/htscodecs-corpus/tree/master/names (as suggested by #57 (comment)) concatenated to each other (original size 33986678 bytes). Table entries contain compressed size in bytes / compression time in milliseconds; "new" is commit 3dca7de and "old" is commit 047aa84:

Level libdeflate (new) libdeflate (old) zlib
1 5212311 / 124 5384620 / 114 6007496 / 180
2 4965955 / 152 5283019 / 125 5856632 / 200
3 4934629 / 168 5229218 / 140 5598722 / 293
4 4850526 / 185 5142765 / 164 5223622 / 317
5 4608971 / 221 4902344 / 206 5011470 / 463
6 4505713 / 280 4804306 / 271 4927452 / 765
7 4453372 / 417 4758233 / 395 4749052 / 1049
8 4326029 / 760 4975789 / 1225 4658839 / 2557
9 4305630 / 1222 4643199 / 1759 4658374 / 3009
10 4044222 / 3471 4314207 / 2749 N/A
11 4023112 / 4762 4100868 / 4899 N/A
12 4019345 / 5893 4048534 / 6219 N/A

Here are the results for the seq file mentioned in #57 (comment). (Original size 109258164 bytes; it contains 301-character DNA sequences.)

Level libdeflate (new) libdeflate (old) zlib
1 8762885 / 284 8807034 / 281 11185034 / 464
2 7389469 / 317 7672567 / 291 10172931 / 526
3 6519681 / 380 7453610 / 314 9300280 / 811
4 6107118 / 502 7310010 / 382 7606177 / 816
5 6042739 / 511 6765495 / 407 6904720 / 1318
6 5916795 / 906 6600458 / 569 6658313 / 2961
7 5869025 / 2268 6494286 / 1137 6592924 / 5258
8 5769527 / 4326 6514850 / 4014 6448238 / 13671
9 5764232 / 5057 6217206 / 4907 6427283 / 14969
10 5727261 / 7158 6086587 / 6346 N/A
11 5669294 / 11322 5977565 / 8700 N/A
12 5601831 / 21063 5909305 / 12232 N/A

Results for TAIR10_chr1.fas (Arabidopsis thaliana chromosome 1; original size 30812908 bytes):

Level libdeflate (new) libdeflate (old) zlib
1 9622710 / 163 9622848 / 167 10434288 / 370
2 9445449 / 238 9505670 / 210 10055812 / 498
3 8621168 / 390 9370333 / 264 9650765 / 1016
4 8125898 / 689 9230411 / 376 9649402 / 733
5 8122981 / 697 9212240 / 431 9396040 / 1863
6 8080866 / 1476 9047177 / 679 9081950 / 5066
7 8120894 / 3832 8860922 / 1546 8931774 / 8950
8 8137945 / 6482 8642301 / 3975 8730647 / 25025
9 8135049 / 7257 8358494 / 5054 8723843 / 26177
10 7951675 / 4353 8343450 / 5255 N/A
11 7951495 / 4999 8304799 / 6008 N/A
12 7950784 / 5585 8296509 / 6865 N/A

Results for a FASTQ file containing Illumina sequencing reads (original size 146798341 bytes):

Level libdeflate (new) libdeflate (old) zlib
1 66241681 / 1008 68846711 / 1008 71346659 / 1993
2 65232261 / 1168 67777691 / 1142 70120245 / 2379
3 64525976 / 1458 66962260 / 1426 68239283 / 3870
4 64210758 / 1612 66104860 / 1879 66418997 / 3361
5 63297095 / 1847 65360530 / 2071 65578003 / 6554
6 62579781 / 2594 64673920 / 2854 64258311 / 14491
7 61960388 / 4458 64029845 / 4601 63481420 / 21404
8 61733834 / 7801 62099740 / 12120 63056558 / 37828
9 61683688 / 9269 60705152 / 15939 62951205 / 45619
10 59923123 / 16291 60470044 / 16975 N/A
11 59900738 / 19499 60032207 / 20433 N/A
12 59889501 / 23012 59956692 / 23556 N/A

Note: I improved the algorithms in other ways than the min_match_len heuristics, but that is the main one that matters here. In cases where the min_match_len improved the compression ratio a lot at levels 1-7, it is expected for there to be a performance regression, due to the increased number of literals used. Also, the regression in compression ratio at level 9 on the last file is expected, since the algorithm used at levels 8-9 changed to a faster one that avoids very bad results with certain types of files.

from libdeflate.

jkbonfield avatar jkbonfield commented on September 7, 2024

Many thanks. That looks like great work. :-)

from libdeflate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.