Comments (9)
Just so I understand, this optimization is good for DNA files because having a smaller match length causes individual literals to be encoded using a more efficient Huffman encoding (because the resulting tree will only have LZ matches > ~8 in length in the lcode tree), and DNA reads are mostly random-looking data?
from libdeflate.
This issue concerns Illumina-sequencing read identifiers, not the actual DNA data in the read.
from libdeflate.
Sorry, I was confused by the formatting a little. So, in this particular case you see an improvement when setting the min match length because it will LZ the head of the ID (repeats heavily) then Huffman the tail, interesting.
from libdeflate.
Do you have any sample files available I could test with? I'd like to investigate whether this case could be handled better automatically using some heuristics. Adding more options increases API complexity so is generally something I'd like to avoid.
from libdeflate.
Many thanks for the quick reply. A heuristic if possible would be better still. I was just thinking about what would be simple (and offering to do it if it's something you'd want). I don't understand it well enough to work out where to put heuristics though.
Take a look at the data in https://github.com/jkbonfield/htscodecs-corpus/tree/master/names
A quick run through all the *.names files with and without a +4 hack as above:
Original
$ for i in tests/htscodecs-corpus/names/*.names;do ~/ftp/compression/libdeflate/gzip < $i 2>/dev/null|wc -c;done
72201
79313
101113
55364
100884
53751
75087
31300
59432
19211
177192
With change:
$ for i in tests/htscodecs-corpus/names/*.names;do ~/ftp/compression/libdeflate/gzip < $i 2>/dev/null|wc -c;done
66852
72401
92062
50348
93563
54053
68551
32222
54043
17307
161436
It generally helps a lot, but not universally.
On this data I've wondered whether a custom algorithm that creates deflate format using purely line-oriented prefix matching would give the best speed for size tradeoff. (However for slower modes of my work I have a dedicated (non-deflate) name compressor anyway, so the impetus to be clever isn't huge.)
from libdeflate.
Sorry, I was confused by the formatting a little. So, in this particular case you see an improvement when setting the min match length because it will LZ the head of the ID (repeats heavily) then Huffman the tail, interesting.
I haven't analysed quite what it's doing, but my thinking was the form of these strings is basically common prefix (instrument name) then :lane:tile:x:y
. There is some dictionary component to tile, but the x and y are largely random. There may be lots of X and Y strings and substrings just because with only 10 characters the digits have a good chance of random matching (1 in 100 if we include colon + 2 digits).
It's quite easy for parsers to generate an LZ match for "Y newline prefix:lane" instead of just "prefix:lane", possibly at the expense of making a longer distance when it's not really justified for random chance match. (I assume that's what the optimal parser is trying to judge though.) Changing the min match length doesn't stop that so it's not what my change achieved clearly. I did try hacking it to require a newline starting character but that turned out to be tragic, so canned that idea quickly.
from libdeflate.
Another test, this time with DNA sequence data, also shows Zlib's Z_DEFAULT_STRATEGY vs Z_FILTERED to be a win. I'll put the data file in ftp://ftp.sanger.ac.uk/pub/users/jkb/seq
Z_DEFAULT Z_FILTERED Default+Opt Default-Opt MIN_10 -Opt 7za
1 11185034 11185034 8807052 8807052 6767390
2 10172931 10172931 7672585 7672585 6329695
3 9300280 9300280 7453628 7453628 6208820
4 7606177 6803282 7310028 7310028 6102399
5 6904720 6537932 6765513 6765513 6027264
6 6658313 6447436 6600476 6600476 5897747
7 6592924 6440727 6494304 6494304 5831461
8 6448238 6325678 6514868 6434674 5790594
9 6427283 6303307 6217224 6385173 5751458 5722540
10 6086605
11 5977583
12 5909323
That shows Zlib with Z_DEFAULT_STRATEGY vs Z_FILTERED, Libdeflate as cloned (optimal parsing enabled for level 8-12), libdeflate with SUPPORT_NEAR_OPTIMAL_PARSING disabled (levels 8,9 only), libdeflate with SUPPORT_NEAR_OPTIMAL_PARSING disabled and +7 on the existing min_match len of 3, and finally max compression of 7zip in gzip mode.
You can see by default libdeflate beats zlib's defaults. Zlib gains a bit using Z_FILTERED, but it's still beaten when we use higher levels of libdeflate. However libdefault with a longer minimum match length trounces everything bar 7za and it gets pretty close to matching that too.
For what it's worth, I see the same benefits in zlib. Z_FILTERED is changing min match from 3 to 5. Bumping it to 10 manually shows level 6 gives 5935354 and level 9 5773148, so again we see potential for manual tweakage there too.
PS. I don't know how to limit the minimum match length for the optimal parser, so no numbers have been presented there.
from libdeflate.
I implemented some heuristics which handle setting the minimum match length automatically. Commit 69a7ca0 is the main one.
Here are the results for compressing all the .names
files from https://github.com/jkbonfield/htscodecs-corpus/tree/master/names (as suggested by #57 (comment)) concatenated to each other (original size 33986678 bytes). Table entries contain compressed size in bytes / compression time in milliseconds
; "new" is commit 3dca7de and "old" is commit 047aa84:
Level | libdeflate (new) | libdeflate (old) | zlib |
---|---|---|---|
1 | 5212311 / 124 | 5384620 / 114 | 6007496 / 180 |
2 | 4965955 / 152 | 5283019 / 125 | 5856632 / 200 |
3 | 4934629 / 168 | 5229218 / 140 | 5598722 / 293 |
4 | 4850526 / 185 | 5142765 / 164 | 5223622 / 317 |
5 | 4608971 / 221 | 4902344 / 206 | 5011470 / 463 |
6 | 4505713 / 280 | 4804306 / 271 | 4927452 / 765 |
7 | 4453372 / 417 | 4758233 / 395 | 4749052 / 1049 |
8 | 4326029 / 760 | 4975789 / 1225 | 4658839 / 2557 |
9 | 4305630 / 1222 | 4643199 / 1759 | 4658374 / 3009 |
10 | 4044222 / 3471 | 4314207 / 2749 | N/A |
11 | 4023112 / 4762 | 4100868 / 4899 | N/A |
12 | 4019345 / 5893 | 4048534 / 6219 | N/A |
Here are the results for the seq
file mentioned in #57 (comment). (Original size 109258164 bytes; it contains 301-character DNA sequences.)
Level | libdeflate (new) | libdeflate (old) | zlib |
---|---|---|---|
1 | 8762885 / 284 | 8807034 / 281 | 11185034 / 464 |
2 | 7389469 / 317 | 7672567 / 291 | 10172931 / 526 |
3 | 6519681 / 380 | 7453610 / 314 | 9300280 / 811 |
4 | 6107118 / 502 | 7310010 / 382 | 7606177 / 816 |
5 | 6042739 / 511 | 6765495 / 407 | 6904720 / 1318 |
6 | 5916795 / 906 | 6600458 / 569 | 6658313 / 2961 |
7 | 5869025 / 2268 | 6494286 / 1137 | 6592924 / 5258 |
8 | 5769527 / 4326 | 6514850 / 4014 | 6448238 / 13671 |
9 | 5764232 / 5057 | 6217206 / 4907 | 6427283 / 14969 |
10 | 5727261 / 7158 | 6086587 / 6346 | N/A |
11 | 5669294 / 11322 | 5977565 / 8700 | N/A |
12 | 5601831 / 21063 | 5909305 / 12232 | N/A |
Results for TAIR10_chr1.fas
(Arabidopsis thaliana chromosome 1; original size 30812908 bytes):
Level | libdeflate (new) | libdeflate (old) | zlib |
---|---|---|---|
1 | 9622710 / 163 | 9622848 / 167 | 10434288 / 370 |
2 | 9445449 / 238 | 9505670 / 210 | 10055812 / 498 |
3 | 8621168 / 390 | 9370333 / 264 | 9650765 / 1016 |
4 | 8125898 / 689 | 9230411 / 376 | 9649402 / 733 |
5 | 8122981 / 697 | 9212240 / 431 | 9396040 / 1863 |
6 | 8080866 / 1476 | 9047177 / 679 | 9081950 / 5066 |
7 | 8120894 / 3832 | 8860922 / 1546 | 8931774 / 8950 |
8 | 8137945 / 6482 | 8642301 / 3975 | 8730647 / 25025 |
9 | 8135049 / 7257 | 8358494 / 5054 | 8723843 / 26177 |
10 | 7951675 / 4353 | 8343450 / 5255 | N/A |
11 | 7951495 / 4999 | 8304799 / 6008 | N/A |
12 | 7950784 / 5585 | 8296509 / 6865 | N/A |
Results for a FASTQ file containing Illumina sequencing reads (original size 146798341 bytes):
Level | libdeflate (new) | libdeflate (old) | zlib |
---|---|---|---|
1 | 66241681 / 1008 | 68846711 / 1008 | 71346659 / 1993 |
2 | 65232261 / 1168 | 67777691 / 1142 | 70120245 / 2379 |
3 | 64525976 / 1458 | 66962260 / 1426 | 68239283 / 3870 |
4 | 64210758 / 1612 | 66104860 / 1879 | 66418997 / 3361 |
5 | 63297095 / 1847 | 65360530 / 2071 | 65578003 / 6554 |
6 | 62579781 / 2594 | 64673920 / 2854 | 64258311 / 14491 |
7 | 61960388 / 4458 | 64029845 / 4601 | 63481420 / 21404 |
8 | 61733834 / 7801 | 62099740 / 12120 | 63056558 / 37828 |
9 | 61683688 / 9269 | 60705152 / 15939 | 62951205 / 45619 |
10 | 59923123 / 16291 | 60470044 / 16975 | N/A |
11 | 59900738 / 19499 | 60032207 / 20433 | N/A |
12 | 59889501 / 23012 | 59956692 / 23556 | N/A |
Note: I improved the algorithms in other ways than the min_match_len heuristics, but that is the main one that matters here. In cases where the min_match_len improved the compression ratio a lot at levels 1-7, it is expected for there to be a performance regression, due to the increased number of literals used. Also, the regression in compression ratio at level 9 on the last file is expected, since the algorithm used at levels 8-9 changed to a faster one that avoids very bad results with certain types of files.
from libdeflate.
Many thanks. That looks like great work. :-)
from libdeflate.
Related Issues (20)
- ESA matchfinder HOT 4
- crc32 is twice as fast in zlib-ng HOT 19
- how to change sliding window size at runtime? HOT 1
- arm_acle.h compiler errors when building for iOS with Xcode 15.3 HOT 4
- analyzer warning regression: The left operand of '+' is a garbage value due to array index out of bounds HOT 1
- Unaligned access on PowerPC: forgotten macros in `common_defs.h` HOT 11
- 32-bit build error when using cctools 949.0.1 with assembler that doesn't support AVX instructions HOT 7
- VNNI support seems to require something later than GCC 11.1 HOT 3
- error: inlining failed in call to 'always_inline' in some builds HOT 1
- BUILD: target specific option mismatch HOT 3
- 1.20 does not compile HOT 2
- compression & decompression usage from browser to libdeflate
- Static builds and exported symbols HOT 2
- Compilation on aarch64-linux requires GCC 7.1
- libdeflate_crc32 performance when buffer is 15 bytes or fewer HOT 8
- benchmark against another zlib fork HOT 1
- ARM feature issues on GCC 13.3.1? HOT 7
- Compilation Error Despite Disabling AVX Instructions HOT 3
- build with GCC13 on NetBSD failed due to Error: unsupported instruction `vpdpbusd' HOT 6
- Building error in ARM v8 system HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from libdeflate.