This is a very hot function responsible for 30% of the time spent in inflate on an <co

THere is a faster rep copy algorithm with this commit <a class="commit-link" data-hove

Inflate: `copy_rep_matches` can be faster about zune-image HOT 25 CLOSED

etemesi254 commented on June 18, 2024

Inflate: `copy_rep_matches` can be faster

from zune-image.

Comments (25)

etemesi254 commented on June 18, 2024 2

I did some benchmarks with libdeflate, that thing is wickedly fast.

zune-inflate: 169 m
miniz-oxide: 174 m
libdeflate: 148 m

I think that warrants some more agressive safe optimizations. Lemme extract that zlib file from the exr file

from zune-image.

etemesi254 commented on June 18, 2024 1

THere is a faster rep copy algorithm with this commit 6bb95ac I think it should boost the speeeds currently.

Fuzz tested the three fuzzers this time,(sorry for the 0.2 release), so I don't think there are mistakes, but additional fuzzing wouldn't hurt.

CI should run fuzzing now.

from zune-image.

Shnatsel commented on June 18, 2024 1

I confirm it has improved the exr benchmark. Before:

test read_single_image_non_parallel_zips_rgba        ... bench: 199,622,716 ns/iter (+/- 338,823)

After:

test read_single_image_non_parallel_zips_rgba        ... bench: 187,659,328 ns/iter (+/- 76,902)

miniz_oxide for comparison:

test read_single_image_non_parallel_zips_rgba        ... bench: 196,783,469 ns/iter (+/- 172,721)

from zune-image.

etemesi254 commented on June 18, 2024 1

I confirm it has improved the exr benchmark. Before:

test read_single_image_non_parallel_zips_rgba        ... bench: 199,622,716 ns/iter (+/- 338,823)

After:

test read_single_image_non_parallel_zips_rgba        ... bench: 187,659,328 ns/iter (+/- 76,902)

miniz_oxide for comparison:

test read_single_image_non_parallel_zips_rgba        ... bench: 196,783,469 ns/iter (+/- 172,721)

This does also improve png decoding by 3% , almost hitting 2x faster speeds than imagers/png

from zune-image.

etemesi254 commented on June 18, 2024

Isn't the first one unsafe?

It does the equivalent of the below C code

while (dst<src){
   *dst++ =*src++;
}

And what exact optimizer hints? Would love to explore them.
Lemme experiment with some few ideas and see how it turns out

from zune-image.

Shnatsel commented on June 18, 2024

rle-decode-fast crate impements RLE in safe code. It includes some unsafe for a function that was since included in the standard library, vec.extend_from_within(), but the RLE code itself is safe.

And no, it doesn't do what you described - it copies regions that are exponentially increasing in size instead of copying byte-by-byte:

https://github.com/WanzenBug/rle-decode-helper/blob/bdafaedc43accc2c7c53408144f2bc58bfc1bb0a/src/lib.rs#L35-L65

Same result, different and faster implementation.

And what exact optimizer hints? Would love to explore them.

I meant the bounds check elision tricks I covered in the article. Adding assert!s on slice lengths allowed the optimizer to remove bounds checks in this function in the inflate crate, see image-rs/inflate#43

from zune-image.

etemesi254 commented on June 18, 2024

Assert hints are finicky, they are amazing when they work but they don't work mostly

e.g say assert(offset+12<dest.len()); won't help remove bounds check, because offset may be a big integer and adding 12 in release mode causes it to wrap around and become smaller than dest.len(), meaning that passes but during indexing it's an out of bounds.
The compiler knows this and hence it can't elide it.

Theoretically saturating add should work but it also doesn't, this is probably a missed optimization.

I've done a small change in the dev branch, should reduce the number of times we call copy_rep_matches, I've tested on my machine that fuzzing still works, and all tests pass.

Could you check if that helps

from zune-image.

Shnatsel commented on June 18, 2024

Sadly my machine is very noisy and is not really usable for benchmarking. I could get a cloud VM for the tests, but you'll need to tell me what I'm looking for exactly - criterion benchmark times or something else? The generated assembly, maybe?

from zune-image.

Shnatsel commented on June 18, 2024

Anyway I'm probably off for the night, but I'll try to take a look tomorrow or so.

from zune-image.

etemesi254 commented on June 18, 2024

just Criterion Benchmarks, or whatever you used for exr checks, hopefully the region should be smaller.

from zune-image.

Shnatsel commented on June 18, 2024

$ cargo +nightly fuzz run decode_buffer -- -timeout=1s

Almost immediately prints:

thread '<unnamed>' panicked at 'assertion failed: self.bits_left >= num_bits', /home/shnatsel/Code/zune-image/zune-inflate/src/bitstream.rs:106:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Also I think exr tests hang. PR: johannesvollmer/exrs#179 but the decode_buffer fuzzer should find it too

from zune-image.

etemesi254 commented on June 18, 2024

Fixed in 0.2.3

from zune-image.

etemesi254 commented on June 18, 2024

just Criterion Benchmarks, or whatever you used for exr checks, hopefully the region should be smaller.

The check that was added makes us call copy_rep_matches for matches less than FASTCOPY_BYTES and whose offsets would cause overlap on the current source.

Previously as long as the match was an overlap, we would call copy_rep_matches.

This now means for the hot loop, copy_rep_matches is called only for situations where loop is less than 16, remember we have room to do sloppy copies, but I can't get the compiler to elide bounds check even with a fixed/constant and asserts

from zune-image.

Shnatsel commented on June 18, 2024

sorry for the 0.2 release

No worries. Nobody's using it in prod yet, and it's good that we broke something now and learned from it rather than later!

from zune-image.

Shnatsel commented on June 18, 2024

I see that in 6bb95ac the copying loop is still very hot. I have an idea on how this could be improved, and how to remove bounds checks here.

We can write the output without bounds checks by iterating over the output slice with chunks_exact_mut as long as we have somewhere to copy from that doesn't overlap with the place to which we're writing.

imagine we have the fragment abc which we need to RLE-decode into abc.abc.abc.a (dots inserted for readability only, don't appear in the output).

The possible permutations we may have to write, assuming we copy 4 bytes at a time (FASTCOPY_BYTES = 4), are: abca, bcab, cabc, and that's it. So as long as we have written a slice that contains all of them, namely abc.abc, we can do split_at_mut on the stream, have the abc.abc on the left side, iterate over chunks_exact_mut on the right side, and copy the abca bcab cabc chunks from the left to the right.

This is guaranteed to remove the bounds checks from the hot loop.

We need to do one final copy of the remainder afterwards for correctness, as is common for chunks_exact_mut, to process any remaining data less than FASTCOPY_BYTES, but that's not hard.

from zune-image.

etemesi254 commented on June 18, 2024

Permute copy is interesting, the only decompressor I'm aware that does this is snappy, at https://github.com/google/snappy/blob/984b191f0fefdeb17050b42a90b7625999c13b8d/snappy.cc#L218-L339, but do remember the loop is copying less than FASTCOPY_BYTES(16 for now). this may mean we should not spend more time setting it up than we spend copying.

from zune-image.

Shnatsel commented on June 18, 2024

Ah, that's fair.

I wonder if permute copy with a really small size, like 2, will help at all? It will eliminate bounds checks on the actual copies, but may have other drawbacks.

from zune-image.

Shnatsel commented on June 18, 2024

Chances are good that the cost of the setup is going to dominate. It might be useful for non-overlapping copies though, since those are also hot and also have bounds checks right now.

from zune-image.

Shnatsel commented on June 18, 2024

Oh, the setup for permute copy is actually quite cheap: the sequence only needs to occur twice in a row (i.e. be copied once), after which you can run permute copy for the rest.

So it is an improvement as long as the sequence we're repeating is short; but if the sequence is as long or longer as the chunk we're writing, then we'll just spend the entire time in the setup and never reach permute copy.

from zune-image.

Shnatsel commented on June 18, 2024

Sorry if I'm stating the obvious, I'm discovering things and thinking them through as I go 😅

from zune-image.

etemesi254 commented on June 18, 2024

let dest_src_offset = src_offset + FASTCOPY_BYTES;
let (src, dest) = out_block.split_at_mut(current_position);
                            
current_position += FASTCOPY_BYTES;
  
let count = (length - 1) / FASTCOPY_BYTES;

 for (src_chunk, dest_chunk) in src[dest_src_offset..]
      .chunks_exact(FASTCOPY_BYTES)
      .zip(dest.chunks_exact_mut(FASTCOPY_BYTES))
      .take(count)
  {
      dest_chunk.copy_from_slice(src_chunk);
  }

Would have been brilliant for the long copy, but doesn't work when the match becomes overlapping

from zune-image.

Shnatsel commented on June 18, 2024

The long copy is hot on the profile too, so if this works for it - amazing!

I think you'll need to do a permute copy for correctness, so use slightly more complex iterator on src, something like src.windows().repeat()with maybe even astep_by()` in there somewhere, not sure.

But I'm going to take a break from thinking about optimizing RLE, it has kind of consumed my life lately 😅

from zune-image.

etemesi254 commented on June 18, 2024

Sorry if I'm stating the obvious, I'm discovering things and thinking them through as I go sweat_smile

Ideas are always welcome here :)

from zune-image.

etemesi254 commented on June 18, 2024

The long copy is hot on the profile too, so if this works for it - amazing!

I think you'll need to do a permute copy for correctness, so use slightly more complex iterator on src, something like src.windows().repeat()with maybe even astep_by()` in there somewhere, not sure.

But I'm going to take a break from thinking about optimizing RLE, it has kind of consumed my life lately sweat_smile

I took my break,time to put my skills to test :)

from zune-image.

Shnatsel commented on June 18, 2024

There have been RLE decoding speed improvements and this issue isn't actionable anymore, closing.

from zune-image.

Inflate: `copy_rep_matches` can be faster about zune-image HOT 25 CLOSED

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent