Giter Site home page Giter Site logo

Comments (17)

disheng222 avatar disheng222 commented on June 9, 2024

Hi Peter,
For the difficult-to-compress cases (or with fairly low error bound), I suggest you to increase the number of quantization bins. To this end, you can modify the sz.config as follows:
max_quant_intervals = 67108864
(The default value was 65536)
Then, use the following command to compress the dataset again:
sz -c sz.config -z -d -i chaotic_data.dat -M ABS -A tolerance -3 256 256 256

I tested that double-type dataset (its values are very random as I observed, so the compression ratio is low).

ABS bit-rate PSNR
2^-14 13.9 83.043196
2^-15 15.3 89.058258
2^-16 16.5 95.080285
2^-17 17.9 101.100309
2^-18 22.4 113.142963
2^-19 26.3 119.162523
2^-20 32.6 125.183684

I think this result makes more sense.
Increasing max_quant_intervals can mitigate that issue. As you know, max_quant_intervals is represented as integer type, so there is an upper limit for this setting. Then, when the error bound is small enough, the bit rate would jump again.
Hope this answer is helpful to your research.

Best,
Sheng

from sz.

lindstro avatar lindstro commented on June 9, 2024

Sheng, thanks for the suggestion. As someone who has not played around with SZ parameter settings much, is there a way to algorithmically choose the number of quantization bins, or is this basically a trial and error process that one has to go through for each data set?

As evidenced by the table you listed, the rate increase starts climbing again for the last few tolerances (ideally, the rate would not increase by more than one bit), which suggests to me that one would then have to increase the number of quantization bins even further. Is there a limit on the number of bins? Should it be a power of two?

I imagine that increasing the number of bins may adversely affect the compression ratio for large tolerances, i.e., there may be no single setting that works well for a wide range of tolerances. Otherwise, would there not be a better default setting?

Finally, is there a way to set this parameter without using an sz.config file? It is a bit cumbersome to have to generate such a text file when using the SZ command-line utility in shell scripts.

from sz.

robertu94 avatar robertu94 commented on June 9, 2024

I'll leave the rest to Sheng.

Finally, is there a way to set this parameter without using an sz.config file? It is a bit cumbersome to have to generate such a text file when using the SZ command-line utility in shell scripts

@lindstro Not with the sz command line, but the libpressio command line from my libpressio-tools spack package provides a way to do that for SZ

git clone https://github.com/robertu94/spack_packages robertu94_packages
spack repo add ./robertu94_packages
spack install libpressio-tools ^ libpressio+sz+zfp
#if your compiler is older (pre c++17) you might need this instead
spack install libpressio-tools ^ libpressio+sz+zfp ^ libstdcompat+boost
spack load libpressio-tools

# doesn't matter here, but dims are in fortran order for all compressors; last 3 args print metrics
pressio -i ./chaotic_data.dat -t double -d 256 -d 256 -d 256 -b compressor=sz -o sz:max_quant_intervals=67108864 -o pressio:abs=1e-14 -m time -m size -M all

#print help for a compressor
pressio -b compressor=sz -a help

from sz.

lindstro avatar lindstro commented on June 9, 2024

@robertu94 Thanks for pointing that out. That is very convenient and another reason why I need to take a closer look at libpressio.

from sz.

disheng222 avatar disheng222 commented on June 9, 2024

Hi Peter,
As for the sharp increase of the bitrate at accuracy of 2^-20, that's because of the Huffman tree overhead.
I checked the content of the compressed data. Basically, when the accuracy is 2^-12 to 2^-16, the major part is from the Huffman-encoded bytes and the Huffman tree overhead is tiny. However, when the accuracy is 2^-20, the Huffman-tree itself is about 1.5X as large as the Huffman-encoded bytes. In fact, in SZ, we didn't try best to store Huffman tree as dense as possible, because SZ is mainly focused on the lossy compression use-cases, and we didn't target the cases with extremely low error bound such as 2^-20. That is, if the error bound is fairly low or if the original data file is very small (e.g., around 1MB), the Huffman tree overhead is not negligible, and we should study how to minimize such an overhead.

To check the detailed component of the compressed data in SZ, you can do the following things:
./configure --enable-writestats
make

And then, when you use the executable 'sz' command, you can add the option -q to print the stats: more information such as Huffman tree size, encoded bytes.
BTW, you can use -p to print the actual number of quantization bins used in the compression: e.g., sz -p -s chaotic_data.dat.sz
if "actual used # intervals" doesn't reach 67108864, that means the quantization bin count is large enough to cover all the predicted values. BTW, the 'predThreshold' in sz.config is better to be set to 0.999 or larger value in order to make sure more data points can be covered by the quantization bins. All the above parameters are supported by Libpressio.

Best,
Sheng

from sz.

robertu94 avatar robertu94 commented on June 9, 2024

@lindstro to elaborate on this

BTW, you can use -p to print the actual number of quantization bins used in the compression: e.g., sz -p -s chaotic_data.dat.sz

LibPressio’s CLI is capable out outputting this information as well. It will automatically detect that you compiled ‘sz+stats’ and add these metrics to the output above. The entire install command then would be:

‘’’
spack install libpressio-tools ^ libpressio+sz+zfp ^ sz+stats
‘’’

from sz.

lindstro avatar lindstro commented on June 9, 2024

In fact, in SZ, we didn't try best to store Huffman tree as dense as possible, because SZ is mainly focused on the lossy compression use-cases, and we didn't target the cases with extremely low error bound such as 2^-20.

Thanks for the explanation. I would argue that a tolerance of 2-20 ≈ 10-6 for data in (0, 1) is not all that low; it provides less accuracy than single precision. But it is good to know that this limitation is not unique to this data set, even if such difficult-to-compress data may require a larger number of quantization bins.

Just to make sure I fully understand, using 2n bins allows SZ to encode residuals in roughly n bits (depending on entropy, of course). But for very large residuals (or small n) outside the range of available bins, SZ may have to record the residual as a full floating-point value, which generally is more expensive. And for random data, SZ is likely to give poor predictions that result in frequent, large residuals. Do I have that right?

To check the detailed component of the compressed data in SZ, you can do the following things: ./configure --enable-writestats make

That seems handy. Is there an analogous setting for CMake builds? cmake -L does not reveal anything obvious.

BTW, you can use -p to print the actual number of quantization bins used in the compression: e.g., sz -p -s chaotic_data.dat.sz if "actual used # intervals" doesn't reach 67108864, that means the quantization bin count is large enough to cover all the predicted values.

I see. In this case, should one rerun with the number of bins reported?

from sz.

robertu94 avatar robertu94 commented on June 9, 2024

Leaving the rest to @disheng222

That seems handy. Is there an analogous setting for CMake builds? cmake -L does not reveal anything obvious.

@lindstro the corresponding setting is BUILD_STATS:BOOL=ON

That is what libpressio uses when it build with sz+stats

from sz.

lindstro avatar lindstro commented on June 9, 2024

@lindstro the corresponding setting is BUILD_STATS:BOOL=ON

Doh! Not sure how I missed that. Thanks.

from sz.

disheng222 avatar disheng222 commented on June 9, 2024

@lindstro
"I would argue that a tolerance of 2-20 ≈ 10-6 for data in (0, 1) is not all that low"
Yes. But the chaotic dataset you are using here is essentially a random data, so the prediction in SZ works very inefficiently. This is why we need to use a larger number of quantization bins to cover it. This situation is not expected by SZ. :-)

"Just to make sure I fully understand, using 2^n bins allows SZ to encode residuals in roughly n bits (depending on entropy, of course). ......."
This understanding is basically correct. Please kindly note that the SZ significantly depends on the smoothness of the data ('cause it uses prediction method and encode the residuals). By the 'entropy' here, if you mean the entropy of the residuals, we can say that the compression results depend on it (instead of entropy of original data).

You can use -q or -p or both in your compression operation as follows:

[sdi@localhost example]$ sz -p -q -z -d -c sz.config -i chaotic_data.dat -M ABS -A 1E-5 -3 256 256 256
===============stats about sz================
Constant data? : NO
use_mean: YES
blockSize 6
lorenzoPercent 0.955958
regressionPercent 0.044042
lorenzoBlocks 70825
regressionBlocks 3263
totalBlocks 74088
huffmanTreeSize 7792474
huffmanCodingSize 36967490
huffmanCompressionRatio 1.499306
huffmanNodeCount 599421
unpredictCount 0
unpredictPercent 0.000000
compression time = 1.063208
compressed data file: chaotic_data.dat.sz
=================SZ Compression Meta Data=================
Version: 2.1.12
Constant data?: NO
Lossless?: NO
Size type (size of # elements): 8 bytes
Num of elements: 16777216
compressor Name: SZ
Data type: DOUBLE
min value of raw data: 0.000000
max value of raw data: 1.000000
quantization_intervals: 0
max_quant_intervals: 67108864
actual used # intervals: 524288
dataEndianType (prior raw data): LITTLE_ENDIAN
sysEndianType (at compression): LITTLE_ENDIAN
sampleDistance: 100
predThreshold: 0.999000
szMode: SZ_BEST_COMPRESSION (with Zstd or Gzip)
gzipMode: Z_BEST_SPEED
errBoundMode: ABS
absErrBound: 0.000010
[sdi@localhost example]$

The -p metadata information are stored in the compressed data, so you can also use -p to check the compressed data as follows:

[sdi@localhost example]$ sz -p -s chaotic_data.dat.sz
=================SZ Compression Meta Data=================
Version: 2.1.12
Constant data?: NO
Lossless?: NO
Size type (size of # elements): 8 bytes
Num of elements: 16777216
compressor Name: SZ
Data type: DOUBLE
min value of raw data: 0.000000
max value of raw data: 1.000000
quantization_intervals: 0
max_quant_intervals: 67108864
actual used # intervals: 524288
dataEndianType (prior raw data): LITTLE_ENDIAN
sysEndianType (at compression): LITTLE_ENDIAN
sampleDistance: 100
predThreshold: 0.999000
szMode: SZ_BEST_COMPRESSION (with Zstd or Gzip)
gzipMode: Z_BEST_SPEED
errBoundMode: ABS
absErrBound: 0.000010

from sz.

lindstro avatar lindstro commented on June 9, 2024

@disheng222 I tried your proposed fix of using 226 quantization bins. This does improve things a little for this type of random data at high rates (low tolerances). There's a huge jump in rate, from about 32 to 60, when halving the error tolerance. That's a bit surprising.

sz

I'm also including the results of this change in number of bins in the case of compressible data--the Miranda viscosity field from SDRBench. It doesn't seem that using more bins helps at high rates here. In fact, SZ does worse when increasing the number of bins. I'm just curious if there are other parameters that might help.

sz
viscosity

from sz.

disheng222 avatar disheng222 commented on June 9, 2024

I tested the SZ2 (github.com/szcompressor/SZ) and SZ3 (github.com/szcompressor/SZ3) using Miranda's viscocity.d64.
I got the following results about the accuracy gain. I implemented the accuracy gain in the following way:
α = log₂(σ / E) - R, where σ is the standard deviation of the original dataset, and E is the root mean squared error, and R is the bit-rate (w.r.t 64 bits). Hope this calculation is correct.
Then, I got the following results:
Screenshot from 2022-07-23 00-03-16
I am using fixed quantization bins (65536) to run SZ2 and SZ3.
It seems that the accuracy gain I got is slightly different from your result.
Another observation is that SZ3 is better than SZ2.

The drop of the SZ's accuracy gain is likely because of the drop of its compression ratio when the error bound is very small. One can increase the max number of quantization bins via the configuration file, but this may also increase the Huffman tree size, which may mitigate the compression ratio improvement. How to store the Huffman tree more effectively was not the focus of our previous design, because we didn't consider such a small error bound before. That is, I just used a naive implementation to store Huffman tree. Improving the implementation of the Huffman tree storage can increase accuracy gain to a certain extent, which is not done yet.
In fact, recently, we developed a rather better version (called QoZ) than SZ3, which can get better quality for high-error cases but not for low-error cases. It hasn't been integrated in SZ3 github repo yet.

from sz.

disheng222 avatar disheng222 commented on June 9, 2024

I tested SZ3 using the maximum # quantization bins 1048576, which can be set in sz3.config.
Then, I got the following results:

Screenshot from 2022-07-23 10-04-17

The result (i.e., SZ3_1m) looks better than both SZ2 and SZ3(default).
If I use larger number of quantization bins, the result would get worse probably because of the Huffman tree storage overhead, as I mentioned in previous comment. The Huffman tree size has an upper bound based on the max number of quantiztation bins, but this overhead is not negligible when the original data size (e.g., 256x384x384 in this case) is not very big. That said, if the original data size is 1024^3, the impact of Huffman tree overhead may not be that big.

Best,.
Sheng

from sz.

lindstro avatar lindstro commented on June 9, 2024

@disheng222 Thanks for your suggestions. We decided to go with SZ2 because it supports a pointwise relative error bound. AFAIK, SZ3 does not.

Your Miranda plots look much like the "default" curve I included and less like the S-shaped curve I got after increasing the number of bins.

Do you know why there's such a large jump in rate for the more random data?

from sz.

disheng222 avatar disheng222 commented on June 9, 2024

from sz.

lindstro avatar lindstro commented on June 9, 2024

SZ3 also supports point-wise relative error bounds, but we didn't release
this function in API. We can do it soon, and will let you know when it's
ready.

@disheng222 Thanks--that would be great. If you don't mind, can we leave this issue open until you have added support to SZ3? I may have some follow-up questions once that's working.

from sz.

disheng222 avatar disheng222 commented on June 9, 2024

from sz.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.