Giter Site home page Giter Site logo

zarr-developers / zarr-python Goto Github PK

View Code? Open in Web Editor NEW
1.4K 44.0 262.0 14.97 MB

An implementation of chunked, compressed, N-dimensional arrays for Python.

Home Page: http://zarr.readthedocs.io/

License: MIT License

Python 99.23% Roff 0.77%
hacktoberfest zarr ndimensional-arrays compressed python

zarr-python's People

Contributors

alimanfoo avatar alt-shivam avatar andrewfulton9 avatar carreau avatar clbarnes avatar d-v-b avatar dependabot[bot] avatar dimitripapadopoulos avatar don-bran avatar dstansby avatar grlee77 avatar jakirkham avatar jhamman avatar jni avatar joshmoore avatar jrbourbeau avatar madsbk avatar martindurant avatar maxrjones avatar mrocklin avatar msankeys963 avatar mzjp2 avatar normanrz avatar pre-commit-ci[bot] avatar qulogic avatar rabernat avatar raphaeldussin avatar saransh-cpp avatar shikharsg avatar vincentschut avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zarr-python's Issues

Checksum filters

Could add codecs to check for data corruption. CRC32 and Adler32 could be implemented via zlib module from Python standard library. HDF5 uses Fletcher32, not sure where implementation could be available from.

Array creation from data, chunks is None

When creating an array via array(), if user does not provide chunks, zarr checks data for chunks, but does not deal with situation where data.chunks is None, so generates error when tries to take len().

Detect AVX2

Detect AVX2 support within setup.py and enable it when compiling.

Other compressor codecs

It would be possible to implement Zstd, LZ4 and Snappy codecs that make use of these compressors directly, not via Blosc. Also implementing a Zlib codec directly on C code rather than via Python stdlib would probably be faster. Source code for these is already present within the c-blosc submodule. Personally I would always go via Blosc so not strongly motivated to do this myself, but keeping this as placeholder.

Blosc extension use bytes

Currently the blosc extension uses array.array for memory allocation and to minimize buffer copies. This is also possible using bytes, via PyBytes_FromStringAndSize(NULL, nbytes), PyBytes_AS_STRING and Py_SIZE, an example is the python-zstd extension.

Returning bytes would be marginally better for compatibility, e.g., the HDFS mapping implementation can only handle bytes, and so needs to copy array to bytes if given array.

v2.0 release

Release actions:

  • merge filters PR
  • deal with remaining issues
  • run tests
  • git tag
  • pypi release
  • git release
  • conda-forge release
  • upload windows wheels to pypi

Overwrite when creating array in group

It would be useful to be able to create an array in a group with an option to overwrite any existing array with the given name if present.

Proposed to add an "overwrite=False" keyword argument to all creation methods on the Group class.

F (Fortran) order

It should be possible to support the use of F (Fortran) order to organised data within each chunk, as well as the current default C order. This may improve data compression in some cases, depending on the autocorrelation structure within an array.

Optionally use blosc in non-contextual mode

Add an option to use blosc in the multi-threaded non-contextual (i.e., global blocking) mode, which is better when using zarr in a single-threaded environment because it allows blosc to use multiple threads internally.

Group list members optimization when scanning keys

When a store does not implement listdir and keys need to be scanned, there is a possible optimization for listing members of a group because child arrays and groups could be discovered within the key scan, rather than requiring additional __contains__ tests.

open... functions work with any store

The functions open_array and open_group could accept a store as argument, providing mode semantics for opening any store.

Also the open function could be made more flexible to return group or array depending on what is found.

Reduce memory copies with no compression

Review scenarios where no compression is requested, either via compression=None or via Blosc with clevel=0. Are there any opportunities to avoid unnecessary memory copies?

Array.nchunks property

Add a property with total number of chunks for the array. Makes it easier to check if all chunks initialized.

Default fill value zero where possible

It would be better for the fill_value to be zero than None in array creation functions such as array and create. With no fill value, chunks overhanging the edgy of an array get filled with random memory, which may be very poorly compressible.

Scale-offset filter

Consider implementing a scale-offset filter similar to HDF5, at least for floating point data.

Default compression settings

I have noticed performance increases in other projects when I choose default compression settings based on dtype.

Optimal compression settings depend strongly on bit patterns. Data types often strongly indicate bit pattern characteristics. For example integers often benefit more from compression than floats. Datetimes are often nearly sorted and so benefit more from shuffle.

It might improve performance to change the compression defaults in defaults.py to come from a function that takes the dtype as an input.

ZipStore performance

Currently ZipStore performance is spectacularly poor because a ZipFile is re-opened on every __getitem__ call, causing the zip contents to be read multiple times. To have any reasonable performance we need to rewrite along the lines of zict to open the ZipFile once and have that re-used for each __getitem__ call. N.B., this will also mean having to implement flush() and/or context manager protocol for writing to a ZipStore.

Detect main thread and adapt Blosc usage automatically

Currently if you switch between using Zarr in the main thread (e.g., making Zarr API calls directly in an IPython session) and using from multi-threaded context (e.g., if you use a Zarr array as part of a Dask computation), you have to manually switch the way Blosc is used by calling zarr.blosc.use_context(True) and zarr.blosc.use_context(False). This is cumbersome for interactive analysis. This could be avoided if the Blosc extension checked if the current thread is the main thread, and if so use Blosc in non-contextual mode, otherwise use Blosc in contextual mode, so the user doesn't have to do any manual switching.

None compressor

Currently there is a NoneCompressor class which provides no compression. This could be removed if the Array class explicitly handles the case where the compressor is None, resulting in some code simplification.

TempStore

Would be useful to have a TempStore extending DirectoryStore but using a temporary directory.

Codec __eq__

It would be convenient at least for testing if the codec classes implemented __eq__.

Hierarchical storage

As originally suggested here, add a zarr.hdf module which provides a hierarchical storage system for managing Zarr arrays.

Generalize to other storage systems

I want something very similar to zarr on S3 and I'm pondering the easiest way to get there. One approach is to generalize zarr to accept pluggable byte storage solutions.

Currently, I believe that zarr effectively treats the file system as a MutableMapping into which it can deposit and retrieve bytes. If this is the case then what are your thoughts on actually using the MutableMapping interface instead of touching files directly? That way I could provide MutableMappings that use file systems, zip files, s3, hdfs, etc.. This nicely isolates a lot of the "where do I put this block of bytes" logic from the array slicing and compression logic.

For concreteness, here is a MutableMapping that loads/stores data in a directory on the file system. https://github.com/mrocklin/zict/blob/master/zict/file.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.