Comments (25)
I did some benchmarks with libdeflate, that thing is wickedly fast.
zune-inflate: 169 m
miniz-oxide: 174 m
libdeflate: 148 m
I think that warrants some more agressive safe optimizations. Lemme extract that zlib file from the exr file
from zune-image.
THere is a faster rep copy algorithm with this commit 6bb95ac I think it should boost the speeeds currently.
Fuzz tested the three fuzzers this time,(sorry for the 0.2 release), so I don't think there are mistakes, but additional fuzzing wouldn't hurt.
CI should run fuzzing now.
from zune-image.
I confirm it has improved the exr
benchmark. Before:
test read_single_image_non_parallel_zips_rgba ... bench: 199,622,716 ns/iter (+/- 338,823)
After:
test read_single_image_non_parallel_zips_rgba ... bench: 187,659,328 ns/iter (+/- 76,902)
miniz_oxide for comparison:
test read_single_image_non_parallel_zips_rgba ... bench: 196,783,469 ns/iter (+/- 172,721)
from zune-image.
I confirm it has improved the
exr
benchmark. Before:test read_single_image_non_parallel_zips_rgba ... bench: 199,622,716 ns/iter (+/- 338,823)
After:
test read_single_image_non_parallel_zips_rgba ... bench: 187,659,328 ns/iter (+/- 76,902)
miniz_oxide for comparison:
test read_single_image_non_parallel_zips_rgba ... bench: 196,783,469 ns/iter (+/- 172,721)
This does also improve png decoding by 3% , almost hitting 2x faster speeds than imagers/png
from zune-image.
Isn't the first one unsafe?
It does the equivalent of the below C code
while (dst<src){
*dst++ =*src++;
}
And what exact optimizer hints? Would love to explore them.
Lemme experiment with some few ideas and see how it turns out
from zune-image.
rle-decode-fast
crate impements RLE in safe code. It includes some unsafe for a function that was since included in the standard library, vec.extend_from_within()
, but the RLE code itself is safe.
And no, it doesn't do what you described - it copies regions that are exponentially increasing in size instead of copying byte-by-byte:
Same result, different and faster implementation.
And what exact optimizer hints? Would love to explore them.
I meant the bounds check elision tricks I covered in the article. Adding assert!
s on slice lengths allowed the optimizer to remove bounds checks in this function in the inflate
crate, see image-rs/inflate#43
from zune-image.
Assert hints are finicky, they are amazing when they work but they don't work mostly
e.g say assert(offset+12<dest.len())
; won't help remove bounds check, because offset may be a big integer and adding 12 in release mode causes it to wrap around and become smaller than dest.len()
, meaning that passes but during indexing it's an out of bounds.
The compiler knows this and hence it can't elide it.
Theoretically saturating add should work but it also doesn't, this is probably a missed optimization.
I've done a small change in the dev branch, should reduce the number of times we call copy_rep_matches, I've tested on my machine that fuzzing still works, and all tests pass.
Could you check if that helps
from zune-image.
Sadly my machine is very noisy and is not really usable for benchmarking. I could get a cloud VM for the tests, but you'll need to tell me what I'm looking for exactly - criterion benchmark times or something else? The generated assembly, maybe?
from zune-image.
Anyway I'm probably off for the night, but I'll try to take a look tomorrow or so.
from zune-image.
just Criterion Benchmarks, or whatever you used for exr checks, hopefully the region should be smaller.
from zune-image.
$ cargo +nightly fuzz run decode_buffer -- -timeout=1s
Almost immediately prints:
thread '<unnamed>' panicked at 'assertion failed: self.bits_left >= num_bits', /home/shnatsel/Code/zune-image/zune-inflate/src/bitstream.rs:106:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Also I think exr
tests hang. PR: johannesvollmer/exrs#179 but the decode_buffer
fuzzer should find it too
from zune-image.
Fixed in 0.2.3
from zune-image.
just Criterion Benchmarks, or whatever you used for exr checks, hopefully the region should be smaller.
The check that was added makes us call copy_rep_matches
for matches less than FASTCOPY_BYTES
and whose offsets would cause overlap on the current source.
Previously as long as the match was an overlap, we would call copy_rep_matches
.
This now means for the hot loop, copy_rep_matches is called only for situations where loop is less than 16, remember we have room to do sloppy copies, but I can't get the compiler to elide bounds check even with a fixed/constant and asserts
from zune-image.
sorry for the 0.2 release
No worries. Nobody's using it in prod yet, and it's good that we broke something now and learned from it rather than later!
from zune-image.
I see that in 6bb95ac the copying loop is still very hot. I have an idea on how this could be improved, and how to remove bounds checks here.
We can write the output without bounds checks by iterating over the output slice with chunks_exact_mut
as long as we have somewhere to copy from that doesn't overlap with the place to which we're writing.
imagine we have the fragment abc
which we need to RLE-decode into abc.abc.abc.a
(dots inserted for readability only, don't appear in the output).
The possible permutations we may have to write, assuming we copy 4 bytes at a time (FASTCOPY_BYTES
= 4), are: abca
, bcab
, cabc
, and that's it. So as long as we have written a slice that contains all of them, namely abc.abc
, we can do split_at_mut
on the stream, have the abc.abc
on the left side, iterate over chunks_exact_mut
on the right side, and copy the abca
bcab
cabc
chunks from the left to the right.
This is guaranteed to remove the bounds checks from the hot loop.
We need to do one final copy of the remainder afterwards for correctness, as is common for chunks_exact_mut
, to process any remaining data less than FASTCOPY_BYTES
, but that's not hard.
from zune-image.
Permute copy is interesting, the only decompressor I'm aware that does this is snappy, at https://github.com/google/snappy/blob/984b191f0fefdeb17050b42a90b7625999c13b8d/snappy.cc#L218-L339, but do remember the loop is copying less than FASTCOPY_BYTES(16 for now). this may mean we should not spend more time setting it up than we spend copying.
from zune-image.
Ah, that's fair.
I wonder if permute copy with a really small size, like 2, will help at all? It will eliminate bounds checks on the actual copies, but may have other drawbacks.
from zune-image.
Chances are good that the cost of the setup is going to dominate. It might be useful for non-overlapping copies though, since those are also hot and also have bounds checks right now.
from zune-image.
Oh, the setup for permute copy is actually quite cheap: the sequence only needs to occur twice in a row (i.e. be copied once), after which you can run permute copy for the rest.
So it is an improvement as long as the sequence we're repeating is short; but if the sequence is as long or longer as the chunk we're writing, then we'll just spend the entire time in the setup and never reach permute copy.
from zune-image.
Sorry if I'm stating the obvious, I'm discovering things and thinking them through as I go 😅
from zune-image.
let dest_src_offset = src_offset + FASTCOPY_BYTES;
let (src, dest) = out_block.split_at_mut(current_position);
current_position += FASTCOPY_BYTES;
let count = (length - 1) / FASTCOPY_BYTES;
for (src_chunk, dest_chunk) in src[dest_src_offset..]
.chunks_exact(FASTCOPY_BYTES)
.zip(dest.chunks_exact_mut(FASTCOPY_BYTES))
.take(count)
{
dest_chunk.copy_from_slice(src_chunk);
}
Would have been brilliant for the long copy, but doesn't work when the match becomes overlapping
from zune-image.
The long copy is hot on the profile too, so if this works for it - amazing!
I think you'll need to do a permute copy for correctness, so use slightly more complex iterator on src
, something like src.windows()
.repeat()with maybe even a
step_by()` in there somewhere, not sure.
But I'm going to take a break from thinking about optimizing RLE, it has kind of consumed my life lately
from zune-image.
Sorry if I'm stating the obvious, I'm discovering things and thinking them through as I go sweat_smile
Ideas are always welcome here :)
from zune-image.
The long copy is hot on the profile too, so if this works for it - amazing!
I think you'll need to do a permute copy for correctness, so use slightly more complex iterator on
src
, something likesrc.windows()
.repeat()with maybe even a
step_by()` in there somewhere, not sure.But I'm going to take a break from thinking about optimizing RLE, it has kind of consumed my life lately sweat_smile
I took my break,time to put my skills to test :)
from zune-image.
There have been RLE decoding speed improvements and this issue isn't actionable anymore, closing.
from zune-image.
Related Issues (20)
- JPEG: image decoded incorrectly HOT 1
- Channel::new_with_length is not generally sound HOT 13
- Channel without type_id leads to unsound use
- Channel::push has an overflow bug leading to unsoundness
- Zune Inflate: Fails to decode zlib-compressed binary data HOT 1
- JPEG XL: zune produces files that libjxl fails to decode HOT 1
- JPEG XL: zune encodes some images incorrectly HOT 5
- zune-jpeg: decode to BGR/BGRA image HOT 6
- Add zlib license as a license option HOT 3
- PNG: regression after 0.1.0: some images are garbled
- JPEG XL: zune encodes some images incorrectly
- `de_filter_sub3_sse2` calls the inner function without a check HOT 2
- Expand with alpha channel during decoding HOT 20
- DecoderOptions::get_bye_endian typo HOT 1
- new_with_options is inconsistent between PNG and JPEG HOT 2
- Where is Y-flipping? HOT 5
- zune-jpeg: noisy log spam HOT 4
- `zune-jpeg` 2x slower than `image` crate on my M1 MacBook HOT 4
- `zune-jpeg` Decoding to Luma colorspace gives bad results HOT 5
- MJPG decoding: Error decoding huffman values: No DC table for component Y HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from zune-image.