One of my interests in bcolz is the excellent compression benefits and I was getting o

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Compression appears to not work on all arrays about bcolz HOT 3 CLOSED

blosc commented on August 16, 2024

Compression appears to not work on all arrays

from bcolz.

Comments (3)

esc commented on August 16, 2024

The behavior you are seeing is exactly what I would expect and both phenomena you describe are a result of Blosc and the compression algorithms it uses under the hood.

Regarding the first example, i.e. zeros and arange vs. randint: this is a result of how the so called Lempel-Ziv type compression algorithms -- or dictionary encoders -- work. Intuitively, you can imagine that the compression algorithm searches for recurring subsequences within a large sequence. It then replaces any subsequences it has seen before with references to the first occurrence of this sequence, and these references are much shorter than the subsequence itself. This is effectively a way to reduce redundancies in the representation of the data and has the effect of compressing the data by transforming it into a representation that takes up much less space.

There is a complexity measure called Lempel-Ziv complexity which measures how well a sequence can be compressed.

When you do an arange of > 255 the values will wrap around again, i.e. you get a long sequence that always runs from 0-255 and then begins at 0 again. I.e. it contains many exactly equal subsequences and so you can get an amazing compression ratio. The same also applies to ones or zeros where you may get even better compression. Basically these sequences have really low Lempel-Ziv complexity.

Now, if you look at the random, or pseudo-random numbers, they are designed to contain as few redundancies as possible, i.e. random data is the kind of data that can not be compressed. There are no recurring subsequences. And that is the reason why your randint will not compress, it has maximum Lempel-Ziv complexity.

Incidentally I should also mention that an arange of type uint8 from 0 to 255 only will not have any compression ratio also. It is a historic example in one of the Lempel-Ziv papers, you should try it out.

Now the second thing you asked about, is why when choosing a different datatype for random integers in the range of 0-255 you suddenly get compression ratio. This is a result of the shuffle filter
in Blosc. It basically reorganizes multi-byte elements in such a way that the bytes are ordered by significance within a block. This is actually quite simple but a bit tricky to explain, so we have the following diagram:

http://blosc.org/images/shuffle.png

This shuffle filter basically takes all the first bytes from each element in a block and groups them together at the beginning, then it takes the second byte from each element and places them after all the first ones. This proceeds until it reaches the final bytes of each element which are then grouped together at the end of the block.

When casting your random integers from uint8 to (u)int64 you effectively add a bunch of zero bytes to each element, for example, 255 as uint8 is simply a byte with all ones. As uint64 it is 7 zero bytes and a single ones byte. So finally, what you are seeing is the shuffle filter in action, all the added zero bytes of the (u)int64 are grouped together and as explained earlier these all compress really well.

Also, you may want to look at the example in:

http://slides.zetatech.org/haenel-ep14-compress-me-stupid.pdf

On slide 13 and beyond i have illustrated how the shuffle filter filter works on a simple example that is actually quite related to yours.

The take home message I guess, is that the compression ration depends on the complexity of your data.

Hope that helps?

from bcolz.

handloomweaver commented on August 16, 2024

This answer is unbelievably comprehensive and I thank you for it. It completely answers my query. I had in fact begun to wonder that this was some of the ways the compression worked because I installed blosc and began testing it on different structures. It makes total sense and has really helped. For my purposes it highlights that bcolz is an immense tool for storing sparse arrays to disc with super fast retrieval times. Super large dense arrays perhaps not from a compression point of view but it still allows arrays larger than memory to exist. All in all a super project.

from bcolz.

esc commented on August 16, 2024

@handloomweaver great to hear that it helped!

I'd also like to point you additional resources, there is also a Blosc mailinglist:

https://groups.google.com/forum/#!forum/blosc

And a bcolz mailinglist:

https://groups.google.com/forum/#!forum/bcolz

from bcolz.

Compression appears to not work on all arrays about bcolz HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent