Giter Site home page Giter Site logo

Comments (3)

esc avatar esc commented on August 16, 2024

The behavior you are seeing is exactly what I would expect and both phenomena you describe are a result of Blosc and the compression algorithms it uses under the hood.

Regarding the first example, i.e. zeros and arange vs. randint: this is a result of how the so called Lempel-Ziv type compression algorithms -- or dictionary encoders -- work. Intuitively, you can imagine that the compression algorithm searches for recurring subsequences within a large sequence. It then replaces any subsequences it has seen before with references to the first occurrence of this sequence, and these references are much shorter than the subsequence itself. This is effectively a way to reduce redundancies in the representation of the data and has the effect of compressing the data by transforming it into a representation that takes up much less space.

There is a complexity measure called Lempel-Ziv complexity which measures how well a sequence can be compressed.

When you do an arange of > 255 the values will wrap around again, i.e. you get a long sequence that always runs from 0-255 and then begins at 0 again. I.e. it contains many exactly equal subsequences and so you can get an amazing compression ratio. The same also applies to ones or zeros where you may get even better compression. Basically these sequences have really low Lempel-Ziv complexity.

Now, if you look at the random, or pseudo-random numbers, they are designed to contain as few redundancies as possible, i.e. random data is the kind of data that can not be compressed. There are no recurring subsequences. And that is the reason why your randint will not compress, it has maximum Lempel-Ziv complexity.

Incidentally I should also mention that an arange of type uint8 from 0 to 255 only will not have any compression ratio also. It is a historic example in one of the Lempel-Ziv papers, you should try it out.

Now the second thing you asked about, is why when choosing a different datatype for random integers in the range of 0-255 you suddenly get compression ratio. This is a result of the shuffle filter
in Blosc. It basically reorganizes multi-byte elements in such a way that the bytes are ordered by significance within a block. This is actually quite simple but a bit tricky to explain, so we have the following diagram:

http://blosc.org/images/shuffle.png

This shuffle filter basically takes all the first bytes from each element in a block and groups them together at the beginning, then it takes the second byte from each element and places them after all the first ones. This proceeds until it reaches the final bytes of each element which are then grouped together at the end of the block.

When casting your random integers from uint8 to (u)int64 you effectively add a bunch of zero bytes to each element, for example, 255 as uint8 is simply a byte with all ones. As uint64 it is 7 zero bytes and a single ones byte. So finally, what you are seeing is the shuffle filter in action, all the added zero bytes of the (u)int64 are grouped together and as explained earlier these all compress really well.

Also, you may want to look at the example in:

http://slides.zetatech.org/haenel-ep14-compress-me-stupid.pdf

On slide 13 and beyond i have illustrated how the shuffle filter filter works on a simple example that is actually quite related to yours.

The take home message I guess, is that the compression ration depends on the complexity of your data.

Hope that helps?

from bcolz.

handloomweaver avatar handloomweaver commented on August 16, 2024

This answer is unbelievably comprehensive and I thank you for it. It completely answers my query. I had in fact begun to wonder that this was some of the ways the compression worked because I installed blosc and began testing it on different structures. It makes total sense and has really helped. For my purposes it highlights that bcolz is an immense tool for storing sparse arrays to disc with super fast retrieval times. Super large dense arrays perhaps not from a compression point of view but it still allows arrays larger than memory to exist. All in all a super project.

from bcolz.

esc avatar esc commented on August 16, 2024

@handloomweaver great to hear that it helped!

I'd also like to point you additional resources, there is also a Blosc mailinglist:

https://groups.google.com/forum/#!forum/blosc

And a bcolz mailinglist:

https://groups.google.com/forum/#!forum/bcolz

from bcolz.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.