Comments (3)
The behavior you are seeing is exactly what I would expect and both phenomena you describe are a result of Blosc and the compression algorithms it uses under the hood.
Regarding the first example, i.e. zeros and arange vs. randint: this is a result of how the so called Lempel-Ziv type compression algorithms -- or dictionary encoders -- work. Intuitively, you can imagine that the compression algorithm searches for recurring subsequences within a large sequence. It then replaces any subsequences it has seen before with references to the first occurrence of this sequence, and these references are much shorter than the subsequence itself. This is effectively a way to reduce redundancies in the representation of the data and has the effect of compressing the data by transforming it into a representation that takes up much less space.
There is a complexity measure called Lempel-Ziv complexity which measures how well a sequence can be compressed.
When you do an arange of > 255 the values will wrap around again, i.e. you get a long sequence that always runs from 0-255 and then begins at 0 again. I.e. it contains many exactly equal subsequences and so you can get an amazing compression ratio. The same also applies to ones or zeros where you may get even better compression. Basically these sequences have really low Lempel-Ziv complexity.
Now, if you look at the random, or pseudo-random numbers, they are designed to contain as few redundancies as possible, i.e. random data is the kind of data that can not be compressed. There are no recurring subsequences. And that is the reason why your randint will not compress, it has maximum Lempel-Ziv complexity.
Incidentally I should also mention that an arange of type uint8 from 0 to 255 only will not have any compression ratio also. It is a historic example in one of the Lempel-Ziv papers, you should try it out.
Now the second thing you asked about, is why when choosing a different datatype for random integers in the range of 0-255 you suddenly get compression ratio. This is a result of the shuffle filter
in Blosc. It basically reorganizes multi-byte elements in such a way that the bytes are ordered by significance within a block. This is actually quite simple but a bit tricky to explain, so we have the following diagram:
http://blosc.org/images/shuffle.png
This shuffle filter basically takes all the first bytes from each element in a block and groups them together at the beginning, then it takes the second byte from each element and places them after all the first ones. This proceeds until it reaches the final bytes of each element which are then grouped together at the end of the block.
When casting your random integers from uint8 to (u)int64 you effectively add a bunch of zero bytes to each element, for example, 255 as uint8 is simply a byte with all ones. As uint64 it is 7 zero bytes and a single ones byte. So finally, what you are seeing is the shuffle filter in action, all the added zero bytes of the (u)int64 are grouped together and as explained earlier these all compress really well.
Also, you may want to look at the example in:
http://slides.zetatech.org/haenel-ep14-compress-me-stupid.pdf
On slide 13 and beyond i have illustrated how the shuffle filter filter works on a simple example that is actually quite related to yours.
The take home message I guess, is that the compression ration depends on the complexity of your data.
Hope that helps?
from bcolz.
This answer is unbelievably comprehensive and I thank you for it. It completely answers my query. I had in fact begun to wonder that this was some of the ways the compression worked because I installed blosc and began testing it on different structures. It makes total sense and has really helped. For my purposes it highlights that bcolz is an immense tool for storing sparse arrays to disc with super fast retrieval times. Super large dense arrays perhaps not from a compression point of view but it still allows arrays larger than memory to exist. All in all a super project.
from bcolz.
@handloomweaver great to hear that it helped!
I'd also like to point you additional resources, there is also a Blosc mailinglist:
https://groups.google.com/forum/#!forum/blosc
And a bcolz mailinglist:
https://groups.google.com/forum/#!forum/bcolz
from bcolz.
Related Issues (20)
- requirements.txt has numpy version >= 1.17 even for older bcolz
- Failed to install 1.2.1 by pip on Dell Optiplex 5050 («error: conflicting types for ‘_xgetbv’»). Probably problem in compiling if "avx2" in cpu_info["flags"] HOT 8
- Is bcolz works on mac os Catalina? HOT 6
- What is the purpose of validate_names()?
- Deprecation warning due to invalid escape sequences.
- get exception when import bcolz when my git is return non-zero exit.
- undefined reference to `blosc_internal_xgetbv'
- Compile error on shuffle.c (mingw) HOT 2
- bcolz.zeros cannot handle structured data type (ValueError: too many values to unpack (expected 2))
- Installation issue mac HOT 3
- error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 HOT 1
- bcolz\carray_ext.c(45964): error C2039: 'tp_print': is not a member of '_typeobject'
- Is it possible to append value to the end of each column?
- how to share ctable object in share memory with multi process HOT 2
- shit 这**软件 装不上
- cant install bcolz library HOT 1
- how to get my carray data back afterr flush? HOT 2
- Cannot install bcolz for python 3.8 & python 3.9 HOT 8
- Improvements to C code that can improve performance HOT 1
- macos Ventura 13.1 Python 3.7 pip install bcolz error
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bcolz.