The zarr-python's discuss from zarr-developers

Detect AVX2

Detect AVX2 support within setup.py and enable it when compiling.

open... functions work with any store

The functions open_array and open_group could accept a store as argument, providing mode semantics for opening any store.

Also the open function could be made more flexible to return group or array depending on what is found.

Generalize to other storage systems

I want something very similar to zarr on S3 and I'm pondering the easiest way to get there. One approach is to generalize zarr to accept pluggable byte storage solutions.

Currently, I believe that zarr effectively treats the file system as a MutableMapping into which it can deposit and retrieve bytes. If this is the case then what are your thoughts on actually using the MutableMapping interface instead of touching files directly? That way I could provide MutableMappings that use file systems, zip files, s3, hdfs, etc.. This nicely isolates a lot of the "where do I put this block of bytes" logic from the array slicing and compression logic.

For concreteness, here is a MutableMapping that loads/stores data in a directory on the file system. https://github.com/mrocklin/zict/blob/master/zict/file.py

Use numcodecs

The codecs have been factored out into a new package https://github.com/alimanfoo/numcodecs. This means the zarr.codecs module could be removed and replaced by adding numcodecs as a runtime dependency.

Add path argument to array creation functions

Could be useful to add a path argument to the bare array creation functions. Open question then about whether ancestor groups should be created.

Add empty_like, zeros_like, ones_like, full_like

Overwrite when creating array in group

It would be useful to be able to create an array in a group with an option to overwrite any existing array with the given name if present.

Proposed to add an "overwrite=False" keyword argument to all creation methods on the Group class.

Fall-back to pure Python installation

Consider rewriting setup.py so that it tries to compile the blosc cython extension, but if that fails, continues on with a pure Python installation, similar to how simplejson installs.

ZipStore performance

Currently ZipStore performance is spectacularly poor because a ZipFile is re-opened on every __getitem__ call, causing the zip contents to be read multiple times. To have any reasonable performance we need to rewrite along the lines of zict to open the ZipFile once and have that re-used for each __getitem__ call. N.B., this will also mean having to implement flush() and/or context manager protocol for writing to a ZipStore.

RunLength codec

It would be possible to implement a simple run length codec, e.g., making use of https://gist.github.com/nvictus/66627b580c13068589957d6ab0919e66. This would very likely not offer any better compression than proper compressors, but might be interesting to try out.

Default compression settings

I have noticed performance increases in other projects when I choose default compression settings based on dtype.

Optimal compression settings depend strongly on bit patterns. Data types often strongly indicate bit pattern characteristics. For example integers often benefit more from compression than floats. Datetimes are often nearly sorted and so benefit more from shuffle.

It might improve performance to change the compression defaults in defaults.py to come from a function that takes the dtype as an input.

TempStore

Would be useful to have a TempStore extending DirectoryStore but using a temporary directory.

Bitshuffle single byte dtype

Upgrade c-blosc to enable bitshuffle with single byte dtypes.

PY2 compatibility

Add support for Python 2.7.

Group object support item deletion

Useful to delete sub-group or array from group via del statement.

Scale-offset filter

Consider implementing a scale-offset filter similar to HDF5, at least for floating point data.

Migrate v1 to v2

Add function to migrate array metadata from v1 to v2.

Array creation from data, chunks is None

When creating an array via array(), if user does not provide chunks, zarr checks data for chunks, but does not deal with situation where data.chunks is None, so generates error when tries to take len().

Use JSON instead of pickle for array metadata

JSON is much more portable, and leaves open a possibility of writing zarr libraries in other programming languages.

Related: #5

Upgrade c-blosc to 1.10.0

Upgrade c-blosc.

Array.nchunks property

Add a property with total number of chunks for the array. Makes it easier to check if all chunks initialized.

Hierarchical storage

As originally suggested here, add a zarr.hdf module which provides a hierarchical storage system for managing Zarr arrays.

Codec eq

It would be convenient at least for testing if the codec classes implemented __eq__.

Quantize filter

Consider implementing quantize filter as per bcolz.

Consider support for storing chunks in z-order

Apparently, this can result in significant IO savings when indexing multi-dimensional arrays.

See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is interactive)

Blosc extension factor out common buffer access code

Factor out common code to obtain buffer pointer with PY2 compatibility for array.array.

Group list members optimization when scanning keys

When a store does not implement listdir and keys need to be scanned, there is a possible optimization for listing members of a group because child arrays and groups could be discovered within the key scan, rather than requiring additional __contains__ tests.

F (Fortran) order

It should be possible to support the use of F (Fortran) order to organised data within each chunk, as well as the current default C order. This may improve data compression in some cases, depending on the autocorrelation structure within an array.

Upgrade c-blosc to 1.10.2

Describe the zarr file format in the documentation

It looks like the design is very similar to bcolz, but this would be nice to have as a point of reference.

Source links in docs

Get source links working in docs.

Default fill value zero where possible

It would be better for the fill_value to be zero than None in array creation functions such as array and create. With no fill value, chunks overhanging the edgy of an array get filled with random memory, which may be very poorly compressible.

Handle both compressor and compression arguments given

Give the user some feedback if they accidentally provide both compressor and compression_opts kwargs. Either one overrides other and issue warning, or raise ValueError.

v2.0 release

Release actions:

Optionally use blosc in non-contextual mode

Add an option to use blosc in the multi-threaded non-contextual (i.e., global blocking) mode, which is better when using zarr in a single-threaded environment because it allows blosc to use multiple threads internally.

Detect main thread and adapt Blosc usage automatically

Currently if you switch between using Zarr in the main thread (e.g., making Zarr API calls directly in an IPython session) and using from multi-threaded context (e.g., if you use a Zarr array as part of a Dask computation), you have to manually switch the way Blosc is used by calling zarr.blosc.use_context(True) and zarr.blosc.use_context(False). This is cumbersome for interactive analysis. This could be avoided if the Blosc extension checked if the current thread is the main thread, and if so use Blosc in non-contextual mode, otherwise use Blosc in contextual mode, so the user doesn't have to do any manual switching.

Other compressor codecs

It would be possible to implement Zstd, LZ4 and Snappy codecs that make use of these compressors directly, not via Blosc. Also implementing a Zlib codec directly on C code rather than via Python stdlib would probably be faster. Source code for these is already present within the c-blosc submodule. Personally I would always go via Blosc so not strongly motivated to do this myself, but keeping this as placeholder.

Delta filter

Consider implementing delta filter.

Checksum filters

Could add codecs to check for data corruption. CRC32 and Adler32 could be implemented via zlib module from Python standard library. HDF5 uses Fletcher32, not sure where implementation could be available from.

Reduce memory copies with no compression

Review scenarios where no compression is requested, either via compression=None or via Blosc with clevel=0. Are there any opportunities to avoid unnecessary memory copies?

Blosc extension use bytes

Currently the blosc extension uses array.array for memory allocation and to minimize buffer copies. This is also possible using bytes, via PyBytes_FromStringAndSize(NULL, nbytes), PyBytes_AS_STRING and Py_SIZE, an example is the python-zstd extension.

Returning bytes would be marginally better for compatibility, e.g., the HDFS mapping implementation can only handle bytes, and so needs to copy array to bytes if given array.

zarr-developers / zarr-python Goto Github PK

zarr-python's Issues

Recommend Projects

Recommend Topics

Recommend Org