One of the zlib features is the strategy option, which permits Z_DEFAULT_STRATEGY vs Z

User control over minimum match length about libdeflate HOT 9 CLOSED

jkbonfield commented on September 7, 2024

User control over minimum match length

from libdeflate.

Comments (9)

adamkewley commented on September 7, 2024

Just so I understand, this optimization is good for DNA files because having a smaller match length causes individual literals to be encoded using a more efficient Huffman encoding (because the resulting tree will only have LZ matches > ~8 in length in the lcode tree), and DNA reads are mostly random-looking data?

from libdeflate.

chrchang commented on September 7, 2024

This issue concerns Illumina-sequencing read identifiers, not the actual DNA data in the read.

from libdeflate.

adamkewley commented on September 7, 2024

Sorry, I was confused by the formatting a little. So, in this particular case you see an improvement when setting the min match length because it will LZ the head of the ID (repeats heavily) then Huffman the tail, interesting.

from libdeflate.

ebiggers commented on September 7, 2024

Do you have any sample files available I could test with? I'd like to investigate whether this case could be handled better automatically using some heuristics. Adding more options increases API complexity so is generally something I'd like to avoid.

from libdeflate.

jkbonfield commented on September 7, 2024

Many thanks for the quick reply. A heuristic if possible would be better still. I was just thinking about what would be simple (and offering to do it if it's something you'd want). I don't understand it well enough to work out where to put heuristics though.

Take a look at the data in https://github.com/jkbonfield/htscodecs-corpus/tree/master/names

A quick run through all the *.names files with and without a +4 hack as above:

Original

$ for i in tests/htscodecs-corpus/names/*.names;do ~/ftp/compression/libdeflate/gzip < $i 2>/dev/null|wc -c;done
72201
79313
101113
55364
100884
53751
75087
31300
59432
19211
177192

With change:

$ for i in tests/htscodecs-corpus/names/*.names;do ~/ftp/compression/libdeflate/gzip < $i 2>/dev/null|wc -c;done
66852
72401
92062
50348
93563
54053
68551
32222
54043
17307
161436

It generally helps a lot, but not universally.
On this data I've wondered whether a custom algorithm that creates deflate format using purely line-oriented prefix matching would give the best speed for size tradeoff. (However for slower modes of my work I have a dedicated (non-deflate) name compressor anyway, so the impetus to be clever isn't huge.)

from libdeflate.

jkbonfield commented on September 7, 2024

Sorry, I was confused by the formatting a little. So, in this particular case you see an improvement when setting the min match length because it will LZ the head of the ID (repeats heavily) then Huffman the tail, interesting.

I haven't analysed quite what it's doing, but my thinking was the form of these strings is basically common prefix (instrument name) then :lane:tile:x:y. There is some dictionary component to tile, but the x and y are largely random. There may be lots of X and Y strings and substrings just because with only 10 characters the digits have a good chance of random matching (1 in 100 if we include colon + 2 digits).

It's quite easy for parsers to generate an LZ match for "Y newline prefix:lane" instead of just "prefix:lane", possibly at the expense of making a longer distance when it's not really justified for random chance match. (I assume that's what the optimal parser is trying to judge though.) Changing the min match length doesn't stop that so it's not what my change achieved clearly. I did try hacking it to require a newline starting character but that turned out to be tragic, so canned that idea quickly.

from libdeflate.

jkbonfield commented on September 7, 2024

Another test, this time with DNA sequence data, also shows Zlib's Z_DEFAULT_STRATEGY vs Z_FILTERED to be a win. I'll put the data file in ftp://ftp.sanger.ac.uk/pub/users/jkb/seq

      Z_DEFAULT  Z_FILTERED   Default+Opt    Default-Opt  MIN_10 -Opt   7za
1     11185034   11185034     8807052        8807052      6767390
2     10172931   10172931     7672585        7672585      6329695
3      9300280    9300280     7453628        7453628      6208820
4      7606177    6803282     7310028        7310028      6102399
5      6904720    6537932     6765513        6765513      6027264
6      6658313    6447436     6600476        6600476      5897747
7      6592924    6440727     6494304        6494304      5831461
8      6448238    6325678     6514868        6434674      5790594
9      6427283    6303307     6217224        6385173      5751458       5722540
10     		  	      6086605
11     			      5977583
12     			      5909323

That shows Zlib with Z_DEFAULT_STRATEGY vs Z_FILTERED, Libdeflate as cloned (optimal parsing enabled for level 8-12), libdeflate with SUPPORT_NEAR_OPTIMAL_PARSING disabled (levels 8,9 only), libdeflate with SUPPORT_NEAR_OPTIMAL_PARSING disabled and +7 on the existing min_match len of 3, and finally max compression of 7zip in gzip mode.

You can see by default libdeflate beats zlib's defaults. Zlib gains a bit using Z_FILTERED, but it's still beaten when we use higher levels of libdeflate. However libdefault with a longer minimum match length trounces everything bar 7za and it gets pretty close to matching that too.

For what it's worth, I see the same benefits in zlib. Z_FILTERED is changing min match from 3 to 5. Bumping it to 10 manually shows level 6 gives 5935354 and level 9 5773148, so again we see potential for manual tweakage there too.

PS. I don't know how to limit the minimum match length for the optimal parser, so no numbers have been presented there.

from libdeflate.

ebiggers commented on September 7, 2024

I implemented some heuristics which handle setting the minimum match length automatically. Commit 69a7ca0 is the main one.

Here are the results for compressing all the .names files from https://github.com/jkbonfield/htscodecs-corpus/tree/master/names (as suggested by #57 (comment)) concatenated to each other (original size 33986678 bytes). Table entries contain compressed size in bytes / compression time in milliseconds; "new" is commit 3dca7de and "old" is commit 047aa84:

Level	libdeflate (new)	libdeflate (old)	zlib
1	5212311 / 124	5384620 / 114	6007496 / 180
2	4965955 / 152	5283019 / 125	5856632 / 200
3	4934629 / 168	5229218 / 140	5598722 / 293
4	4850526 / 185	5142765 / 164	5223622 / 317
5	4608971 / 221	4902344 / 206	5011470 / 463
6	4505713 / 280	4804306 / 271	4927452 / 765
7	4453372 / 417	4758233 / 395	4749052 / 1049
8	4326029 / 760	4975789 / 1225	4658839 / 2557
9	4305630 / 1222	4643199 / 1759	4658374 / 3009
10	4044222 / 3471	4314207 / 2749	N/A
11	4023112 / 4762	4100868 / 4899	N/A
12	4019345 / 5893	4048534 / 6219	N/A

Here are the results for the seq file mentioned in #57 (comment). (Original size 109258164 bytes; it contains 301-character DNA sequences.)

Level	libdeflate (new)	libdeflate (old)	zlib
1	8762885 / 284	8807034 / 281	11185034 / 464
2	7389469 / 317	7672567 / 291	10172931 / 526
3	6519681 / 380	7453610 / 314	9300280 / 811
4	6107118 / 502	7310010 / 382	7606177 / 816
5	6042739 / 511	6765495 / 407	6904720 / 1318
6	5916795 / 906	6600458 / 569	6658313 / 2961
7	5869025 / 2268	6494286 / 1137	6592924 / 5258
8	5769527 / 4326	6514850 / 4014	6448238 / 13671
9	5764232 / 5057	6217206 / 4907	6427283 / 14969
10	5727261 / 7158	6086587 / 6346	N/A
11	5669294 / 11322	5977565 / 8700	N/A
12	5601831 / 21063	5909305 / 12232	N/A

Results for TAIR10_chr1.fas (Arabidopsis thaliana chromosome 1; original size 30812908 bytes):

Level	libdeflate (new)	libdeflate (old)	zlib
1	9622710 / 163	9622848 / 167	10434288 / 370
2	9445449 / 238	9505670 / 210	10055812 / 498
3	8621168 / 390	9370333 / 264	9650765 / 1016
4	8125898 / 689	9230411 / 376	9649402 / 733
5	8122981 / 697	9212240 / 431	9396040 / 1863
6	8080866 / 1476	9047177 / 679	9081950 / 5066
7	8120894 / 3832	8860922 / 1546	8931774 / 8950
8	8137945 / 6482	8642301 / 3975	8730647 / 25025
9	8135049 / 7257	8358494 / 5054	8723843 / 26177
10	7951675 / 4353	8343450 / 5255	N/A
11	7951495 / 4999	8304799 / 6008	N/A
12	7950784 / 5585	8296509 / 6865	N/A

Results for a FASTQ file containing Illumina sequencing reads (original size 146798341 bytes):

Level	libdeflate (new)	libdeflate (old)	zlib
1	66241681 / 1008	68846711 / 1008	71346659 / 1993
2	65232261 / 1168	67777691 / 1142	70120245 / 2379
3	64525976 / 1458	66962260 / 1426	68239283 / 3870
4	64210758 / 1612	66104860 / 1879	66418997 / 3361
5	63297095 / 1847	65360530 / 2071	65578003 / 6554
6	62579781 / 2594	64673920 / 2854	64258311 / 14491
7	61960388 / 4458	64029845 / 4601	63481420 / 21404
8	61733834 / 7801	62099740 / 12120	63056558 / 37828
9	61683688 / 9269	60705152 / 15939	62951205 / 45619
10	59923123 / 16291	60470044 / 16975	N/A
11	59900738 / 19499	60032207 / 20433	N/A
12	59889501 / 23012	59956692 / 23556	N/A

Note: I improved the algorithms in other ways than the min_match_len heuristics, but that is the main one that matters here. In cases where the min_match_len improved the compression ratio a lot at levels 1-7, it is expected for there to be a performance regression, due to the increased number of literals used. Also, the regression in compression ratio at level 9 on the last file is expected, since the algorithm used at levels 8-9 changed to a faster one that avoids very bad results with certain types of files.

from libdeflate.

jkbonfield commented on September 7, 2024

Many thanks. That looks like great work. :-)

from libdeflate.

User control over minimum match length about libdeflate HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent