zarr-developers / zarr-python Goto Github PK

View Code? Open in Web Editor NEW

1.4K 44.0 262.0 14.97 MB

An implementation of chunked, compressed, N-dimensional arrays for Python.

Home Page: http://zarr.readthedocs.io/

License: MIT License

Python 99.23% Roff 0.77%

hacktoberfest zarr ndimensional-arrays compressed python

zarr-python's People

Contributors

Stargazers

Watchers

Forkers

mrocklin fiolbs jorjuato martindurant francescalted pombredanne jreback vincentschut rowhit mementum will133 rabernat newt0311 fabioz raytone-d yutiansut csnoyes piqueen314 nbren12 friedrichknuth jeromekelleher meteotest lasersonlab shoyer shikharsg aashish24 visr funkey andersy005 alimanfoo jmswaney tam203 lathomas42 erhanbas ambrosejcarr daletovar qulogic christophernhill potter420 eddienko jrbourbeau cedric-lg llllllllll fjetter neurodata 3e jingmouren willirath mzjp2 chris-allan clbarnes raphaeldussin gzuidhof cadair m-novikov constantinpape shifu-engineer manzt tankefugl m-burgoyne gsakkis andrewfulton9 jsheedy leexhwhy mcgibbon bespokoid bilts joshmoore carreau agstephens pziarsolo fbriol davidbrochart leewujung fagan2888 hailiangzhangnasa john-aws cxz hmaarrfk sailfish009 spacetimelabs siligam battyone saheelbreezo hongbo-miao srinivas32 dzaytsev91 taras-sereda hahaxun forman kerkomen alexamici hubayirp wfsteiner subhk hayesgb ericgyounkin yanbin-pan d-v-b grlee77

zarr-python's Issues

Checksum filters

Could add codecs to check for data corruption. CRC32 and Adler32 could be implemented via zlib module from Python standard library. HDF5 uses Fletcher32, not sure where implementation could be available from.

Use JSON instead of pickle for array metadata

JSON is much more portable, and leaves open a possibility of writing zarr libraries in other programming languages.

Related: #5

ZipStore docstring

...and any other doco TODOs.

PY2 compatibility

Add support for Python 2.7.

Fall-back to pure Python installation

Consider rewriting setup.py so that it tries to compile the blosc cython extension, but if that fails, continues on with a pure Python installation, similar to how simplejson installs.

Persistence

Add support for persistent arrays.

Add path argument to array creation functions

Could be useful to add a path argument to the bare array creation functions. Open question then about whether ancestor groups should be created.

Consider support for storing chunks in z-order

Apparently, this can result in significant IO savings when indexing multi-dimensional arrays.

See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is interactive)

Array creation from data, chunks is None

When creating an array via array(), if user does not provide chunks, zarr checks data for chunks, but does not deal with situation where data.chunks is None, so generates error when tries to take len().

Detect AVX2

Detect AVX2 support within setup.py and enable it when compiling.

Other compressor codecs

It would be possible to implement Zstd, LZ4 and Snappy codecs that make use of these compressors directly, not via Blosc. Also implementing a Zlib codec directly on C code rather than via Python stdlib would probably be faster. Source code for these is already present within the c-blosc submodule. Personally I would always go via Blosc so not strongly motivated to do this myself, but keeping this as placeholder.

Group object support item deletion

Useful to delete sub-group or array from group via del statement.

Blosc extension use bytes

Currently the blosc extension uses array.array for memory allocation and to minimize buffer copies. This is also possible using bytes, via PyBytes_FromStringAndSize(NULL, nbytes), PyBytes_AS_STRING and Py_SIZE, an example is the python-zstd extension.

Returning bytes would be marginally better for compatibility, e.g., the HDFS mapping implementation can only handle bytes, and so needs to copy array to bytes if given array.

v2.0 release

Release actions:

Overwrite when creating array in group

It would be useful to be able to create an array in a group with an option to overwrite any existing array with the given name if present.

Proposed to add an "overwrite=False" keyword argument to all creation methods on the Group class.

Migrate v1 to v2

Add function to migrate array metadata from v1 to v2.

F (Fortran) order

It should be possible to support the use of F (Fortran) order to organised data within each chunk, as well as the current default C order. This may improve data compression in some cases, depending on the autocorrelation structure within an array.

Optionally use blosc in non-contextual mode

Add an option to use blosc in the multi-threaded non-contextual (i.e., global blocking) mode, which is better when using zarr in a single-threaded environment because it allows blosc to use multiple threads internally.

Use numcodecs

The codecs have been factored out into a new package https://github.com/alimanfoo/numcodecs. This means the zarr.codecs module could be removed and replaced by adding numcodecs as a runtime dependency.

Describe the zarr file format in the documentation

It looks like the design is very similar to bcolz, but this would be nice to have as a point of reference.

Group list members optimization when scanning keys

When a store does not implement listdir and keys need to be scanned, there is a possible optimization for listing members of a group because child arrays and groups could be discovered within the key scan, rather than requiring additional __contains__ tests.

Fasteners conditional import

Make fasteners a conditional import.

Skip tests that require zict if not installed

Skip the zict tests if zict is not installed, to make the conda-forge setup easier.

open... functions work with any store

The functions open_array and open_group could accept a store as argument, providing mode semantics for opening any store.

Also the open function could be made more flexible to return group or array depending on what is found.

Append with axis

Add support for appending to axis other than 0.

Reduce memory copies with no compression

Review scenarios where no compression is requested, either via compression=None or via Blosc with clevel=0. Are there any opportunities to avoid unnecessary memory copies?

Array.nchunks property

Add a property with total number of chunks for the array. Makes it easier to check if all chunks initialized.

Delta filter

Consider implementing delta filter.

Default fill value zero where possible

It would be better for the fill_value to be zero than None in array creation functions such as array and create. With no fill value, chunks overhanging the edgy of an array get filled with random memory, which may be very poorly compressible.

Scale-offset filter

Consider implementing a scale-offset filter similar to HDF5, at least for floating point data.

Bitshuffle single byte dtype

Upgrade c-blosc to enable bitshuffle with single byte dtypes.

Default compression settings

I have noticed performance increases in other projects when I choose default compression settings based on dtype.

Optimal compression settings depend strongly on bit patterns. Data types often strongly indicate bit pattern characteristics. For example integers often benefit more from compression than floats. Datetimes are often nearly sorted and so benefit more from shuffle.

It might improve performance to change the compression defaults in defaults.py to come from a function that takes the dtype as an input.

Increase default max number of internal Blosc threads

Consider increasing the default Blosc configuration to allow Blosc to use up to 8 cores if available.

ZipStore performance

Currently ZipStore performance is spectacularly poor because a ZipFile is re-opened on every __getitem__ call, causing the zip contents to be read multiple times. To have any reasonable performance we need to rewrite along the lines of zict to open the ZipFile once and have that re-used for each __getitem__ call. N.B., this will also mean having to implement flush() and/or context manager protocol for writing to a ZipStore.

Source links in docs

Get source links working in docs.

RunLength codec

It would be possible to implement a simple run length codec, e.g., making use of https://gist.github.com/nvictus/66627b580c13068589957d6ab0919e66. This would very likely not offer any better compression than proper compressors, but might be interesting to try out.

Detect main thread and adapt Blosc usage automatically

Currently if you switch between using Zarr in the main thread (e.g., making Zarr API calls directly in an IPython session) and using from multi-threaded context (e.g., if you use a Zarr array as part of a Dask computation), you have to manually switch the way Blosc is used by calling zarr.blosc.use_context(True) and zarr.blosc.use_context(False). This is cumbersome for interactive analysis. This could be avoided if the Blosc extension checked if the current thread is the main thread, and if so use Blosc in non-contextual mode, otherwise use Blosc in contextual mode, so the user doesn't have to do any manual switching.

Handle both compressor and compression arguments given

Give the user some feedback if they accidentally provide both compressor and compression_opts kwargs. Either one overrides other and issue warning, or raise ValueError.

Persistent attributes

Support setting of key/value attributes on arrays, including persistent arrays.

Link to HDFSMap in docs

Link to the HDFSMap in API docs when it becomes available.

Quantize filter

Consider implementing quantize filter as per bcolz.

None compressor

Currently there is a NoneCompressor class which provides no compression. This could be removed if the Array class explicitly handles the case where the compressor is None, resulting in some code simplification.

TempStore

Would be useful to have a TempStore extending DirectoryStore but using a temporary directory.

Blosc extension factor out common buffer access code

Factor out common code to obtain buffer pointer with PY2 compatibility for array.array.

Codec eq

It would be convenient at least for testing if the codec classes implemented __eq__.

Hierarchical storage

As originally suggested here, add a zarr.hdf module which provides a hierarchical storage system for managing Zarr arrays.

Generalize to other storage systems

I want something very similar to zarr on S3 and I'm pondering the easiest way to get there. One approach is to generalize zarr to accept pluggable byte storage solutions.

Currently, I believe that zarr effectively treats the file system as a MutableMapping into which it can deposit and retrieve bytes. If this is the case then what are your thoughts on actually using the MutableMapping interface instead of touching files directly? That way I could provide MutableMappings that use file systems, zip files, s3, hdfs, etc.. This nicely isolates a lot of the "where do I put this block of bytes" logic from the array slicing and compression logic.

For concreteness, here is a MutableMapping that loads/stores data in a directory on the file system. https://github.com/mrocklin/zict/blob/master/zict/file.py