I am doing some compression studies that involve difficult-to-compress (even incompres

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Leaving the rest to <a class="user-mention notranslate" data-hovercard-type="user" dat

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Poor compression & quality for difficult-to-compress data,about szcompressor/sz

Comments (17)

disheng222 commented on June 9, 2024

Hi Peter,
For the difficult-to-compress cases (or with fairly low error bound), I suggest you to increase the number of quantization bins. To this end, you can modify the sz.config as follows:
max_quant_intervals = 67108864
(The default value was 65536)
Then, use the following command to compress the dataset again:
sz -c sz.config -z -d -i chaotic_data.dat -M ABS -A tolerance -3 256 256 256

I tested that double-type dataset (its values are very random as I observed, so the compression ratio is low).

ABS bit-rate PSNR
2^-14 13.9 83.043196
2^-15 15.3 89.058258
2^-16 16.5 95.080285
2^-17 17.9 101.100309
2^-18 22.4 113.142963
2^-19 26.3 119.162523
2^-20 32.6 125.183684

I think this result makes more sense.
Increasing max_quant_intervals can mitigate that issue. As you know, max_quant_intervals is represented as integer type, so there is an upper limit for this setting. Then, when the error bound is small enough, the bit rate would jump again.
Hope this answer is helpful to your research.

Best,
Sheng

from sz.

lindstro commented on June 9, 2024

Sheng, thanks for the suggestion. As someone who has not played around with SZ parameter settings much, is there a way to algorithmically choose the number of quantization bins, or is this basically a trial and error process that one has to go through for each data set?

As evidenced by the table you listed, the rate increase starts climbing again for the last few tolerances (ideally, the rate would not increase by more than one bit), which suggests to me that one would then have to increase the number of quantization bins even further. Is there a limit on the number of bins? Should it be a power of two?

I imagine that increasing the number of bins may adversely affect the compression ratio for large tolerances, i.e., there may be no single setting that works well for a wide range of tolerances. Otherwise, would there not be a better default setting?

Finally, is there a way to set this parameter without using an sz.config file? It is a bit cumbersome to have to generate such a text file when using the SZ command-line utility in shell scripts.

from sz.

robertu94 commented on June 9, 2024

I'll leave the rest to Sheng.

Finally, is there a way to set this parameter without using an sz.config file? It is a bit cumbersome to have to generate such a text file when using the SZ command-line utility in shell scripts

@lindstro Not with the sz command line, but the libpressio command line from my libpressio-tools spack package provides a way to do that for SZ

git clone https://github.com/robertu94/spack_packages robertu94_packages
spack repo add ./robertu94_packages
spack install libpressio-tools ^ libpressio+sz+zfp
#if your compiler is older (pre c++17) you might need this instead
spack install libpressio-tools ^ libpressio+sz+zfp ^ libstdcompat+boost
spack load libpressio-tools

# doesn't matter here, but dims are in fortran order for all compressors; last 3 args print metrics
pressio -i ./chaotic_data.dat -t double -d 256 -d 256 -d 256 -b compressor=sz -o sz:max_quant_intervals=67108864 -o pressio:abs=1e-14 -m time -m size -M all

#print help for a compressor
pressio -b compressor=sz -a help

from sz.

lindstro commented on June 9, 2024

@robertu94 Thanks for pointing that out. That is very convenient and another reason why I need to take a closer look at libpressio.

from sz.

disheng222 commented on June 9, 2024

Hi Peter,
As for the sharp increase of the bitrate at accuracy of 2^-20, that's because of the Huffman tree overhead.
I checked the content of the compressed data. Basically, when the accuracy is 2^-12 to 2^-16, the major part is from the Huffman-encoded bytes and the Huffman tree overhead is tiny. However, when the accuracy is 2^-20, the Huffman-tree itself is about 1.5X as large as the Huffman-encoded bytes. In fact, in SZ, we didn't try best to store Huffman tree as dense as possible, because SZ is mainly focused on the lossy compression use-cases, and we didn't target the cases with extremely low error bound such as 2^-20. That is, if the error bound is fairly low or if the original data file is very small (e.g., around 1MB), the Huffman tree overhead is not negligible, and we should study how to minimize such an overhead.

To check the detailed component of the compressed data in SZ, you can do the following things:
./configure --enable-writestats
make

And then, when you use the executable 'sz' command, you can add the option -q to print the stats: more information such as Huffman tree size, encoded bytes.
BTW, you can use -p to print the actual number of quantization bins used in the compression: e.g., sz -p -s chaotic_data.dat.sz
if "actual used # intervals" doesn't reach 67108864, that means the quantization bin count is large enough to cover all the predicted values. BTW, the 'predThreshold' in sz.config is better to be set to 0.999 or larger value in order to make sure more data points can be covered by the quantization bins. All the above parameters are supported by Libpressio.

Best,
Sheng

from sz.

robertu94 commented on June 9, 2024

@lindstro to elaborate on this

BTW, you can use -p to print the actual number of quantization bins used in the compression: e.g., sz -p -s chaotic_data.dat.sz

LibPressio’s CLI is capable out outputting this information as well. It will automatically detect that you compiled ‘sz+stats’ and add these metrics to the output above. The entire install command then would be:

‘’’
spack install libpressio-tools ^ libpressio+sz+zfp ^ sz+stats
‘’’

from sz.

lindstro commented on June 9, 2024

In fact, in SZ, we didn't try best to store Huffman tree as dense as possible, because SZ is mainly focused on the lossy compression use-cases, and we didn't target the cases with extremely low error bound such as 2^-20.

Thanks for the explanation. I would argue that a tolerance of 2^-20 ≈ 10^-6 for data in (0, 1) is not all that low; it provides less accuracy than single precision. But it is good to know that this limitation is not unique to this data set, even if such difficult-to-compress data may require a larger number of quantization bins.

Just to make sure I fully understand, using 2ⁿ bins allows SZ to encode residuals in roughly n bits (depending on entropy, of course). But for very large residuals (or small n) outside the range of available bins, SZ may have to record the residual as a full floating-point value, which generally is more expensive. And for random data, SZ is likely to give poor predictions that result in frequent, large residuals. Do I have that right?

To check the detailed component of the compressed data in SZ, you can do the following things: ./configure --enable-writestats make

That seems handy. Is there an analogous setting for CMake builds? cmake -L does not reveal anything obvious.

BTW, you can use -p to print the actual number of quantization bins used in the compression: e.g., sz -p -s chaotic_data.dat.sz if "actual used # intervals" doesn't reach 67108864, that means the quantization bin count is large enough to cover all the predicted values.

I see. In this case, should one rerun with the number of bins reported?

from sz.

robertu94 commented on June 9, 2024

Leaving the rest to @disheng222

That seems handy. Is there an analogous setting for CMake builds? cmake -L does not reveal anything obvious.

@lindstro the corresponding setting is BUILD_STATS:BOOL=ON

That is what libpressio uses when it build with sz+stats

from sz.

lindstro commented on June 9, 2024

@lindstro the corresponding setting is BUILD_STATS:BOOL=ON

Doh! Not sure how I missed that. Thanks.

from sz.

disheng222 commented on June 9, 2024

@lindstro
"I would argue that a tolerance of 2-20 ≈ 10-6 for data in (0, 1) is not all that low"
Yes. But the chaotic dataset you are using here is essentially a random data, so the prediction in SZ works very inefficiently. This is why we need to use a larger number of quantization bins to cover it. This situation is not expected by SZ. :-)

"Just to make sure I fully understand, using 2^n bins allows SZ to encode residuals in roughly n bits (depending on entropy, of course). ......."
This understanding is basically correct. Please kindly note that the SZ significantly depends on the smoothness of the data ('cause it uses prediction method and encode the residuals). By the 'entropy' here, if you mean the entropy of the residuals, we can say that the compression results depend on it (instead of entropy of original data).

You can use -q or -p or both in your compression operation as follows:

[sdi@localhost example]$ sz -p -q -z -d -c sz.config -i chaotic_data.dat -M ABS -A 1E-5 -3 256 256 256
===============stats about sz================
Constant data? : NO
use_mean: YES
blockSize 6
lorenzoPercent 0.955958
regressionPercent 0.044042
lorenzoBlocks 70825
regressionBlocks 3263
totalBlocks 74088
huffmanTreeSize 7792474
huffmanCodingSize 36967490
huffmanCompressionRatio 1.499306
huffmanNodeCount 599421
unpredictCount 0
unpredictPercent 0.000000
compression time = 1.063208
compressed data file: chaotic_data.dat.sz
=================SZ Compression Meta Data=================
Version: 2.1.12
Constant data?: NO
Lossless?: NO
Size type (size of # elements): 8 bytes
Num of elements: 16777216
compressor Name: SZ
Data type: DOUBLE
min value of raw data: 0.000000
max value of raw data: 1.000000
quantization_intervals: 0
max_quant_intervals: 67108864
actual used # intervals: 524288
dataEndianType (prior raw data): LITTLE_ENDIAN
sysEndianType (at compression): LITTLE_ENDIAN
sampleDistance: 100
predThreshold: 0.999000
szMode: SZ_BEST_COMPRESSION (with Zstd or Gzip)
gzipMode: Z_BEST_SPEED
errBoundMode: ABS
absErrBound: 0.000010
[sdi@localhost example]$

The -p metadata information are stored in the compressed data, so you can also use -p to check the compressed data as follows:

[sdi@localhost example]$ sz -p -s chaotic_data.dat.sz
=================SZ Compression Meta Data=================
Version: 2.1.12
Constant data?: NO
Lossless?: NO
Size type (size of # elements): 8 bytes
Num of elements: 16777216
compressor Name: SZ
Data type: DOUBLE
min value of raw data: 0.000000
max value of raw data: 1.000000
quantization_intervals: 0
max_quant_intervals: 67108864
actual used # intervals: 524288
dataEndianType (prior raw data): LITTLE_ENDIAN
sysEndianType (at compression): LITTLE_ENDIAN
sampleDistance: 100
predThreshold: 0.999000
szMode: SZ_BEST_COMPRESSION (with Zstd or Gzip)
gzipMode: Z_BEST_SPEED
errBoundMode: ABS
absErrBound: 0.000010

from sz.

lindstro commented on June 9, 2024

@disheng222 I tried your proposed fix of using 2²⁶ quantization bins. This does improve things a little for this type of random data at high rates (low tolerances). There's a huge jump in rate, from about 32 to 60, when halving the error tolerance. That's a bit surprising.

I'm also including the results of this change in number of bins in the case of compressible data--the Miranda viscosity field from SDRBench. It doesn't seem that using more bins helps at high rates here. In fact, SZ does worse when increasing the number of bins. I'm just curious if there are other parameters that might help.

from sz.

disheng222 commented on June 9, 2024

I tested the SZ2 (github.com/szcompressor/SZ) and SZ3 (github.com/szcompressor/SZ3) using Miranda's viscocity.d64.
I got the following results about the accuracy gain. I implemented the accuracy gain in the following way:
α = log₂(σ / E) - R, where σ is the standard deviation of the original dataset, and E is the root mean squared error, and R is the bit-rate (w.r.t 64 bits). Hope this calculation is correct.
Then, I got the following results:

I am using fixed quantization bins (65536) to run SZ2 and SZ3.
It seems that the accuracy gain I got is slightly different from your result.
Another observation is that SZ3 is better than SZ2.

The drop of the SZ's accuracy gain is likely because of the drop of its compression ratio when the error bound is very small. One can increase the max number of quantization bins via the configuration file, but this may also increase the Huffman tree size, which may mitigate the compression ratio improvement. How to store the Huffman tree more effectively was not the focus of our previous design, because we didn't consider such a small error bound before. That is, I just used a naive implementation to store Huffman tree. Improving the implementation of the Huffman tree storage can increase accuracy gain to a certain extent, which is not done yet.
In fact, recently, we developed a rather better version (called QoZ) than SZ3, which can get better quality for high-error cases but not for low-error cases. It hasn't been integrated in SZ3 github repo yet.

from sz.

disheng222 commented on June 9, 2024

I tested SZ3 using the maximum # quantization bins 1048576, which can be set in sz3.config.
Then, I got the following results:

The result (i.e., SZ3_1m) looks better than both SZ2 and SZ3(default).
If I use larger number of quantization bins, the result would get worse probably because of the Huffman tree storage overhead, as I mentioned in previous comment. The Huffman tree size has an upper bound based on the max number of quantiztation bins, but this overhead is not negligible when the original data size (e.g., 256x384x384 in this case) is not very big. That said, if the original data size is 1024^3, the impact of Huffman tree overhead may not be that big.

Best,.
Sheng

from sz.

lindstro commented on June 9, 2024

@disheng222 Thanks for your suggestions. We decided to go with SZ2 because it supports a pointwise relative error bound. AFAIK, SZ3 does not.

Your Miranda plots look much like the "default" curve I included and less like the S-shaped curve I got after increasing the number of bins.

Do you know why there's such a large jump in rate for the more random data?

from sz.

disheng222 commented on June 9, 2024

Hi Peter, SZ3 also supports point-wise relative error bounds, but we didn't release this function in API. We can do it soon, and will let you know when it's ready. As for the large jump, there are two situations. If you are using the default setting (65536 bins), when the error bound is small enough, most of the data points cannot be predicted and covered within the quantization bin range (i.e., 65536*2*err_bound). In this situation, most of the values are compressed by the binary-representation analysis (e.g., truncate insignificant bits). That is, this is a different "compression method" compared with the classic SZ pipeline (prediction+quantization+Huffman+....). If you are using a very large number of quantization bins, when the error bound is small, the quantization bin range may still cover the distance between predicted value and real data value. That said, the dominant method is still prediction+quntization+..... However, note that the Huffman tree needs to be stored together in the compressed data. When the defact number of quantization bins is very large, the Huffman tree size could be large. I didn't use a very compact way to store Huffman tree because I suppose Huffman tree size is negligible for the overall compressed data size (which is not true for extremely small error bound however). So, I believe Huffman tree overhead could be a factor to the jump issue, especially when the original dataset has a small or median size (such as 10MB, 100MB, or so, depending on how small the error bound is).

…

On Thu, Jul 28, 2022 at 11:19 AM Peter Lindstrom ***@***.***> wrote: @disheng222 <https://github.com/disheng222> Thanks for your suggestions. We decided to go with SZ2 because it supports a pointwise relative error bound. AFAIK, SZ3 does not. Your Miranda plots look much like the "default" curve I included and less like the S-shaped curve I got after increasing the number of bins. Do you know why there's such a large jump in rate for the more random data? — Reply to this email directly, view it on GitHub <#87 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACK3KSJVI75ZBEM7G3MIS2LVWKXIVANCNFSM5RUWBYKQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from sz.

lindstro commented on June 9, 2024

SZ3 also supports point-wise relative error bounds, but we didn't release
this function in API. We can do it soon, and will let you know when it's
ready.

@disheng222 Thanks--that would be great. If you don't mind, can we leave this issue open until you have added support to SZ3? I may have some follow-up questions once that's working.

from sz.

disheng222 commented on June 9, 2024

Sure. Let's keep this issue open.

…

On Thu, Aug 4, 2022 at 2:35 PM Peter Lindstrom ***@***.***> wrote: SZ3 also supports point-wise relative error bounds, but we didn't release this function in API. We can do it soon, and will let you know when it's ready. @disheng222 <https://github.com/disheng222> Thanks--that would be great. If you don't mind, can we leave this issue open until you have added support to SZ3? I may have some follow-up questions once that's working. — Reply to this email directly, view it on GitHub <#87 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACK3KSM3BMFDZDUX6PBCMULVXQLQ3ANCNFSM5RUWBYKQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from sz.

Poor compression & quality for difficult-to-compress data about sz HOT 17 OPEN

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent