Consider support for storing chunks in z-order about zarr-python HOT 8 OPEN

zarr-developers commented on July 17, 2024 2

Consider support for storing chunks in z-order

from zarr-python.

Comments (8)

alimanfoo commented on July 17, 2024

That's a neat trick. However, unless I'm missing something (very possible),
I think that simply chunking an array will produce a very similar effect.
I.e., if you have an array with shape (100, 100) and you set the chunk
shape to (10, 10), then a request to retrieve data for the range [5:15,
5:15] will only require accessing 4 chunks. A

On Wednesday, 27 July 2016, Stephan Hoyer [email protected] wrote:

Apparently, this can result in significant IO savings when indexing
multi-dimensional arrays.

See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is
interactive)

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/issues/40, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAq8QnhfOfSUxOJS9wws7kXBbmV_OWqaks5qZp7qgaJpZM4JVtUe
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected] [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

from zarr-python.

alimanfoo commented on July 17, 2024

Sorry accidentally sent last comment incomplete. Basically I think that
chunking gives you data locality when taking multidimensional slices of an
array, depending on the chosen chunk shape. So it's not obvious to me how
z-order could be used with chunked arrays to improve I/O.

Cc @benjeffery

On Saturday, 30 July 2016, Alistair Miles <[email protected]
javascript:_e(%7B%7D,'cvml','[email protected]');> wrote:

That's a neat trick. However, unless I'm missing something (very
possible), I think that simply chunking an array will produce a very
similar effect. I.e., if you have an array with shape (100, 100) and you
set the chunk shape to (10, 10), then a request to retrieve data for the
range [5:15, 5:15] will only require accessing 4 chunks. A

On Wednesday, 27 July 2016, Stephan Hoyer [email protected]
wrote:

Apparently, this can result in significant IO savings when indexing
multi-dimensional arrays.

See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is
interactive)

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/issues/40, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAq8QnhfOfSUxOJS9wws7kXBbmV_OWqaks5qZp7qgaJpZM4JVtUe
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

from zarr-python.

shoyer commented on July 17, 2024

This would be for storing data within chunks, supposing support for partially reading chunks (possible with many backends). Given the fixed overhead for accessing each individual chunk, there is a minimal chunk size at which it doesn't make sense to chunk any further (depending on details, perhaps somewhere in the range of 1e3 to 1e6 elements). This could yield significant benefits in that case.

from zarr-python.

alimanfoo commented on July 17, 2024

Ah OK. So what's the main use case? Taking a multidimensional slice where
the requested region is entirely within a single chunk?

Just so I'm completely with you, could you expand a bit more on what you
mean by the second sentence ("Given the overhead for accessing each
due...")?

On Sunday, July 31, 2016, Stephan Hoyer [email protected] wrote:

This would be for storing data within chunks, supposing support for
partially reading chunks (possible with many backends). Given the overhead
for accessing each due, there is a minimal chunk size at which it doesn't
make sense to chunk any further (depending on details, perhaps somewhere in
the range of 1e3 to 1e6 elements). This could yield significant benefits in
that case.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/issues/40#issuecomment-236452319, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAq8QqVh6JC9nwWo8E3PcARQnm3tcYKZks5qbPtXgaJpZM4JVtUe
.

from zarr-python.

shoyer commented on July 17, 2024

Ah OK. So what's the main use case? Taking a multidimensional slice where
the requested region is entirely within a single chunk?

A more common use case might be any slicing operations that are poorly aligned with chunking, even if they involve multiple chunks. For example, suppose I have an array with chunks of size (100, 100, 100), and now want to index out a single point along the first two dimensions, e.g., x[i, j, :]. Instead of needing to read each full chunk all the first two dimensions, I could read out much less on average.

I don't have a direct use case for this right now (since I'm not actually using zarr yet), but I can see lots of examples where this sort of thing might be useful.

Just so I'm completely with you, could you expand a bit more on what you mean by the second sentence ("Given the overhead for accessing each due...")?

Oops, I had a typo (fixed now). What I was referring to is that there is some fixed overhead associated with storing and manipulating each chunk in every storage and task scheduling system, which is why we don't chunk arrays into single values.

from zarr-python.

alimanfoo commented on July 17, 2024

Thanks Stephan. Just to add that this relates to how Zarr might make use of the blosc_getitem function to extract partial contents of a chunk. I know Bcolz uses blosc_getitem but there things are simpler because they only consider 1D arrays, so far I haven't had the brain space to figure out how to use it for multidimensional arrays.

from zarr-python.

shoyer commented on July 17, 2024

blosc_getitem would indeed make good use of this. We would need some method for converting a tuple of range selections into start and nitems, and then load the appropriate items into the output array.

from zarr-python.

alimanfoo commented on July 17, 2024

I don't have bandwidth to explore this myself but very happy to discuss further if there is interest and someone else has time. FWIW I think a natural starting point would be to add support for a codec operation analogous to blosc_getitem, i.e., decoding items from some sub-range of a chunk. This would need to be generalised and added as a method on the Codec class. The Codec abstract base class could provide a default implementation which falls back to decoding then slicing the whole chunk, for codecs that cannot provide an efficient implementation. How this works if there were filters (i.e., multiple codecs) on an array I am uncertain. Once this infrastructure was in place, I guess a Zorder codec class could be implemented as a pre-compression filter, providing the z-order transformation and mapping getitem calls down to the next level? Or maybe it's not that simple, I can't see the architecture immediately.

Btw the next release of Zarr will have all code for compression and filter codecs removed and obtained via a dependency on a new numcodecs package, so this issue may live more naturally there as a codec issue.

One other note, I think that although z-order might speed up access for small regions of an array, it would incur a cost for reading larger regions, because all read and write operations would need to pass through the z-order transformation.

from zarr-python.

Consider support for storing chunks in z-order about zarr-python HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent