zarr-developers / zarr-python Goto Github PK
View Code? Open in Web Editor NEWAn implementation of chunked, compressed, N-dimensional arrays for Python.
Home Page: http://zarr.readthedocs.io/
License: MIT License
An implementation of chunked, compressed, N-dimensional arrays for Python.
Home Page: http://zarr.readthedocs.io/
License: MIT License
Could add codecs to check for data corruption. CRC32 and Adler32 could be implemented via zlib module from Python standard library. HDF5 uses Fletcher32, not sure where implementation could be available from.
JSON is much more portable, and leaves open a possibility of writing zarr libraries in other programming languages.
Related: #5
...and any other doco TODOs.
Add support for Python 2.7.
Consider rewriting setup.py so that it tries to compile the blosc cython extension, but if that fails, continues on with a pure Python installation, similar to how simplejson installs.
Add support for persistent arrays.
Could be useful to add a path argument to the bare array creation functions. Open question then about whether ancestor groups should be created.
Apparently, this can result in significant IO savings when indexing multi-dimensional arrays.
See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is interactive)
When creating an array via array()
, if user does not provide chunks, zarr checks data for chunks, but does not deal with situation where data.chunks is None, so generates error when tries to take len().
Detect AVX2 support within setup.py and enable it when compiling.
It would be possible to implement Zstd, LZ4 and Snappy codecs that make use of these compressors directly, not via Blosc. Also implementing a Zlib codec directly on C code rather than via Python stdlib would probably be faster. Source code for these is already present within the c-blosc submodule. Personally I would always go via Blosc so not strongly motivated to do this myself, but keeping this as placeholder.
Useful to delete sub-group or array from group via del statement.
Currently the blosc extension uses array.array for memory allocation and to minimize buffer copies. This is also possible using bytes, via PyBytes_FromStringAndSize(NULL, nbytes)
, PyBytes_AS_STRING
and Py_SIZE
, an example is the python-zstd extension.
Returning bytes would be marginally better for compatibility, e.g., the HDFS mapping implementation can only handle bytes, and so needs to copy array to bytes if given array.
Release actions:
It would be useful to be able to create an array in a group with an option to overwrite any existing array with the given name if present.
Proposed to add an "overwrite=False" keyword argument to all creation methods on the Group class.
Add function to migrate array metadata from v1 to v2.
It should be possible to support the use of F (Fortran) order to organised data within each chunk, as well as the current default C order. This may improve data compression in some cases, depending on the autocorrelation structure within an array.
Add an option to use blosc in the multi-threaded non-contextual (i.e., global blocking) mode, which is better when using zarr in a single-threaded environment because it allows blosc to use multiple threads internally.
The codecs have been factored out into a new package https://github.com/alimanfoo/numcodecs. This means the zarr.codecs module could be removed and replaced by adding numcodecs as a runtime dependency.
It looks like the design is very similar to bcolz
, but this would be nice to have as a point of reference.
When a store does not implement listdir and keys need to be scanned, there is a possible optimization for listing members of a group because child arrays and groups could be discovered within the key scan, rather than requiring additional __contains__
tests.
Make fasteners a conditional import.
Skip the zict tests if zict is not installed, to make the conda-forge setup easier.
The functions open_array
and open_group
could accept a store as argument, providing mode semantics for opening any store.
Also the open
function could be made more flexible to return group or array depending on what is found.
Add support for appending to axis other than 0.
Review scenarios where no compression is requested, either via compression=None or via Blosc with clevel=0. Are there any opportunities to avoid unnecessary memory copies?
Add a property with total number of chunks for the array. Makes it easier to check if all chunks initialized.
Consider implementing delta filter.
It would be better for the fill_value
to be zero than None
in array creation functions such as array
and create
. With no fill value, chunks overhanging the edgy of an array get filled with random memory, which may be very poorly compressible.
Consider implementing a scale-offset filter similar to HDF5, at least for floating point data.
Upgrade c-blosc to enable bitshuffle with single byte dtypes.
I have noticed performance increases in other projects when I choose default compression settings based on dtype.
Optimal compression settings depend strongly on bit patterns. Data types often strongly indicate bit pattern characteristics. For example integers often benefit more from compression than floats. Datetimes are often nearly sorted and so benefit more from shuffle.
It might improve performance to change the compression defaults in defaults.py
to come from a function that takes the dtype as an input.
Consider increasing the default Blosc configuration to allow Blosc to use up to 8 cores if available.
Currently ZipStore performance is spectacularly poor because a ZipFile is re-opened on every __getitem__
call, causing the zip contents to be read multiple times. To have any reasonable performance we need to rewrite along the lines of zict to open the ZipFile once and have that re-used for each __getitem__
call. N.B., this will also mean having to implement flush() and/or context manager protocol for writing to a ZipStore.
Get source links working in docs.
It would be possible to implement a simple run length codec, e.g., making use of https://gist.github.com/nvictus/66627b580c13068589957d6ab0919e66. This would very likely not offer any better compression than proper compressors, but might be interesting to try out.
Currently if you switch between using Zarr in the main thread (e.g., making Zarr API calls directly in an IPython session) and using from multi-threaded context (e.g., if you use a Zarr array as part of a Dask computation), you have to manually switch the way Blosc is used by calling zarr.blosc.use_context(True)
and zarr.blosc.use_context(False)
. This is cumbersome for interactive analysis. This could be avoided if the Blosc extension checked if the current thread is the main thread, and if so use Blosc in non-contextual mode, otherwise use Blosc in contextual mode, so the user doesn't have to do any manual switching.
Give the user some feedback if they accidentally provide both compressor
and compression_opts
kwargs. Either one overrides other and issue warning, or raise ValueError.
Support setting of key/value attributes on arrays, including persistent arrays.
Link to the HDFSMap in API docs when it becomes available.
Consider implementing quantize filter as per bcolz.
Currently there is a NoneCompressor class which provides no compression. This could be removed if the Array class explicitly handles the case where the compressor is None, resulting in some code simplification.
Would be useful to have a TempStore extending DirectoryStore but using a temporary directory.
Factor out common code to obtain buffer pointer with PY2 compatibility for array.array.
It would be convenient at least for testing if the codec classes implemented __eq__
.
As originally suggested here, add a zarr.hdf
module which provides a hierarchical storage system for managing Zarr arrays.
I want something very similar to zarr
on S3 and I'm pondering the easiest way to get there. One approach is to generalize zarr
to accept pluggable byte storage solutions.
Currently, I believe that zarr
effectively treats the file system as a MutableMapping
into which it can deposit and retrieve bytes. If this is the case then what are your thoughts on actually using the MutableMapping
interface instead of touching files directly? That way I could provide MutableMappings that use file systems, zip files, s3, hdfs, etc.. This nicely isolates a lot of the "where do I put this block of bytes" logic from the array slicing and compression logic.
For concreteness, here is a MutableMapping
that loads/stores data in a directory on the file system. https://github.com/mrocklin/zict/blob/master/zict/file.py
Upgrade c-blosc.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.