Giter Site home page Giter Site logo

Comments (8)

alimanfoo avatar alimanfoo commented on July 17, 2024

That's a neat trick. However, unless I'm missing something (very possible),
I think that simply chunking an array will produce a very similar effect.
I.e., if you have an array with shape (100, 100) and you set the chunk
shape to (10, 10), then a request to retrieve data for the range [5:15,
5:15] will only require accessing 4 chunks. A

On Wednesday, 27 July 2016, Stephan Hoyer [email protected] wrote:

Apparently, this can result in significant IO savings when indexing
multi-dimensional arrays.

See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is
interactive)


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/issues/40, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAq8QnhfOfSUxOJS9wws7kXBbmV_OWqaks5qZp7qgaJpZM4JVtUe
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected] [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

from zarr-python.

alimanfoo avatar alimanfoo commented on July 17, 2024

Sorry accidentally sent last comment incomplete. Basically I think that
chunking gives you data locality when taking multidimensional slices of an
array, depending on the chosen chunk shape. So it's not obvious to me how
z-order could be used with chunked arrays to improve I/O.

Cc @benjeffery

On Saturday, 30 July 2016, Alistair Miles <[email protected]
javascript:_e(%7B%7D,'cvml','[email protected]');> wrote:

That's a neat trick. However, unless I'm missing something (very
possible), I think that simply chunking an array will produce a very
similar effect. I.e., if you have an array with shape (100, 100) and you
set the chunk shape to (10, 10), then a request to retrieve data for the
range [5:15, 5:15] will only require accessing 4 chunks. A

On Wednesday, 27 July 2016, Stephan Hoyer [email protected]
wrote:

Apparently, this can result in significant IO savings when indexing
multi-dimensional arrays.

See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is
interactive)


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/issues/40, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAq8QnhfOfSUxOJS9wws7kXBbmV_OWqaks5qZp7qgaJpZM4JVtUe
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected] [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

from zarr-python.

shoyer avatar shoyer commented on July 17, 2024

This would be for storing data within chunks, supposing support for partially reading chunks (possible with many backends). Given the fixed overhead for accessing each individual chunk, there is a minimal chunk size at which it doesn't make sense to chunk any further (depending on details, perhaps somewhere in the range of 1e3 to 1e6 elements). This could yield significant benefits in that case.

from zarr-python.

alimanfoo avatar alimanfoo commented on July 17, 2024

Ah OK. So what's the main use case? Taking a multidimensional slice where
the requested region is entirely within a single chunk?

Just so I'm completely with you, could you expand a bit more on what you
mean by the second sentence ("Given the overhead for accessing each
due...")?

On Sunday, July 31, 2016, Stephan Hoyer [email protected] wrote:

This would be for storing data within chunks, supposing support for
partially reading chunks (possible with many backends). Given the overhead
for accessing each due, there is a minimal chunk size at which it doesn't
make sense to chunk any further (depending on details, perhaps somewhere in
the range of 1e3 to 1e6 elements). This could yield significant benefits in
that case.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/alimanfoo/zarr/issues/40#issuecomment-236452319, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAq8QqVh6JC9nwWo8E3PcARQnm3tcYKZks5qbPtXgaJpZM4JVtUe
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: [email protected] [email protected]
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

from zarr-python.

shoyer avatar shoyer commented on July 17, 2024

Ah OK. So what's the main use case? Taking a multidimensional slice where
the requested region is entirely within a single chunk?

A more common use case might be any slicing operations that are poorly aligned with chunking, even if they involve multiple chunks. For example, suppose I have an array with chunks of size (100, 100, 100), and now want to index out a single point along the first two dimensions, e.g., x[i, j, :]. Instead of needing to read each full chunk all the first two dimensions, I could read out much less on average.

I don't have a direct use case for this right now (since I'm not actually using zarr yet), but I can see lots of examples where this sort of thing might be useful.

Just so I'm completely with you, could you expand a bit more on what you mean by the second sentence ("Given the overhead for accessing each due...")?

Oops, I had a typo (fixed now). What I was referring to is that there is some fixed overhead associated with storing and manipulating each chunk in every storage and task scheduling system, which is why we don't chunk arrays into single values.

from zarr-python.

alimanfoo avatar alimanfoo commented on July 17, 2024

Thanks Stephan. Just to add that this relates to how Zarr might make use of the blosc_getitem function to extract partial contents of a chunk. I know Bcolz uses blosc_getitem but there things are simpler because they only consider 1D arrays, so far I haven't had the brain space to figure out how to use it for multidimensional arrays.

from zarr-python.

shoyer avatar shoyer commented on July 17, 2024

blosc_getitem would indeed make good use of this. We would need some method for converting a tuple of range selections into start and nitems, and then load the appropriate items into the output array.

from zarr-python.

alimanfoo avatar alimanfoo commented on July 17, 2024

I don't have bandwidth to explore this myself but very happy to discuss further if there is interest and someone else has time. FWIW I think a natural starting point would be to add support for a codec operation analogous to blosc_getitem, i.e., decoding items from some sub-range of a chunk. This would need to be generalised and added as a method on the Codec class. The Codec abstract base class could provide a default implementation which falls back to decoding then slicing the whole chunk, for codecs that cannot provide an efficient implementation. How this works if there were filters (i.e., multiple codecs) on an array I am uncertain. Once this infrastructure was in place, I guess a Zorder codec class could be implemented as a pre-compression filter, providing the z-order transformation and mapping getitem calls down to the next level? Or maybe it's not that simple, I can't see the architecture immediately.

Btw the next release of Zarr will have all code for compression and filter codecs removed and obtained via a dependency on a new numcodecs package, so this issue may live more naturally there as a codec issue.

One other note, I think that although z-order might speed up access for small regions of an array, it would incur a cost for reading larger regions, because all read and write operations would need to pass through the z-order transformation.

from zarr-python.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.