zarr-developers / zarr-python Goto Github PK
View Code? Open in Web Editor NEWAn implementation of chunked, compressed, N-dimensional arrays for Python.
Home Page: http://zarr.readthedocs.io/
License: MIT License
An implementation of chunked, compressed, N-dimensional arrays for Python.
Home Page: http://zarr.readthedocs.io/
License: MIT License
Add support for persistent arrays.
Detect AVX2 support within setup.py and enable it when compiling.
The functions open_array
and open_group
could accept a store as argument, providing mode semantics for opening any store.
Also the open
function could be made more flexible to return group or array depending on what is found.
I want something very similar to zarr
on S3 and I'm pondering the easiest way to get there. One approach is to generalize zarr
to accept pluggable byte storage solutions.
Currently, I believe that zarr
effectively treats the file system as a MutableMapping
into which it can deposit and retrieve bytes. If this is the case then what are your thoughts on actually using the MutableMapping
interface instead of touching files directly? That way I could provide MutableMappings that use file systems, zip files, s3, hdfs, etc.. This nicely isolates a lot of the "where do I put this block of bytes" logic from the array slicing and compression logic.
For concreteness, here is a MutableMapping
that loads/stores data in a directory on the file system. https://github.com/mrocklin/zict/blob/master/zict/file.py
The codecs have been factored out into a new package https://github.com/alimanfoo/numcodecs. This means the zarr.codecs module could be removed and replaced by adding numcodecs as a runtime dependency.
Could be useful to add a path argument to the bare array creation functions. Open question then about whether ancestor groups should be created.
It would be useful to be able to create an array in a group with an option to overwrite any existing array with the given name if present.
Proposed to add an "overwrite=False" keyword argument to all creation methods on the Group class.
Consider rewriting setup.py so that it tries to compile the blosc cython extension, but if that fails, continues on with a pure Python installation, similar to how simplejson installs.
Currently ZipStore performance is spectacularly poor because a ZipFile is re-opened on every __getitem__
call, causing the zip contents to be read multiple times. To have any reasonable performance we need to rewrite along the lines of zict to open the ZipFile once and have that re-used for each __getitem__
call. N.B., this will also mean having to implement flush() and/or context manager protocol for writing to a ZipStore.
It would be possible to implement a simple run length codec, e.g., making use of https://gist.github.com/nvictus/66627b580c13068589957d6ab0919e66. This would very likely not offer any better compression than proper compressors, but might be interesting to try out.
I have noticed performance increases in other projects when I choose default compression settings based on dtype.
Optimal compression settings depend strongly on bit patterns. Data types often strongly indicate bit pattern characteristics. For example integers often benefit more from compression than floats. Datetimes are often nearly sorted and so benefit more from shuffle.
It might improve performance to change the compression defaults in defaults.py
to come from a function that takes the dtype as an input.
Would be useful to have a TempStore extending DirectoryStore but using a temporary directory.
Upgrade c-blosc to enable bitshuffle with single byte dtypes.
Add support for Python 2.7.
Useful to delete sub-group or array from group via del statement.
Consider implementing a scale-offset filter similar to HDF5, at least for floating point data.
Add function to migrate array metadata from v1 to v2.
When creating an array via array()
, if user does not provide chunks, zarr checks data for chunks, but does not deal with situation where data.chunks is None, so generates error when tries to take len().
JSON is much more portable, and leaves open a possibility of writing zarr libraries in other programming languages.
Related: #5
Upgrade c-blosc.
Add a property with total number of chunks for the array. Makes it easier to check if all chunks initialized.
As originally suggested here, add a zarr.hdf
module which provides a hierarchical storage system for managing Zarr arrays.
It would be convenient at least for testing if the codec classes implemented __eq__
.
Consider implementing quantize filter as per bcolz.
Apparently, this can result in significant IO savings when indexing multi-dimensional arrays.
See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is interactive)
Factor out common code to obtain buffer pointer with PY2 compatibility for array.array.
When a store does not implement listdir and keys need to be scanned, there is a possible optimization for listing members of a group because child arrays and groups could be discovered within the key scan, rather than requiring additional __contains__
tests.
It should be possible to support the use of F (Fortran) order to organised data within each chunk, as well as the current default C order. This may improve data compression in some cases, depending on the autocorrelation structure within an array.
It looks like the design is very similar to bcolz
, but this would be nice to have as a point of reference.
Get source links working in docs.
It would be better for the fill_value
to be zero than None
in array creation functions such as array
and create
. With no fill value, chunks overhanging the edgy of an array get filled with random memory, which may be very poorly compressible.
Give the user some feedback if they accidentally provide both compressor
and compression_opts
kwargs. Either one overrides other and issue warning, or raise ValueError.
Release actions:
Add an option to use blosc in the multi-threaded non-contextual (i.e., global blocking) mode, which is better when using zarr in a single-threaded environment because it allows blosc to use multiple threads internally.
Currently if you switch between using Zarr in the main thread (e.g., making Zarr API calls directly in an IPython session) and using from multi-threaded context (e.g., if you use a Zarr array as part of a Dask computation), you have to manually switch the way Blosc is used by calling zarr.blosc.use_context(True)
and zarr.blosc.use_context(False)
. This is cumbersome for interactive analysis. This could be avoided if the Blosc extension checked if the current thread is the main thread, and if so use Blosc in non-contextual mode, otherwise use Blosc in contextual mode, so the user doesn't have to do any manual switching.
It would be possible to implement Zstd, LZ4 and Snappy codecs that make use of these compressors directly, not via Blosc. Also implementing a Zlib codec directly on C code rather than via Python stdlib would probably be faster. Source code for these is already present within the c-blosc submodule. Personally I would always go via Blosc so not strongly motivated to do this myself, but keeping this as placeholder.
Consider implementing delta filter.
Could add codecs to check for data corruption. CRC32 and Adler32 could be implemented via zlib module from Python standard library. HDF5 uses Fletcher32, not sure where implementation could be available from.
Review scenarios where no compression is requested, either via compression=None or via Blosc with clevel=0. Are there any opportunities to avoid unnecessary memory copies?
Currently the blosc extension uses array.array for memory allocation and to minimize buffer copies. This is also possible using bytes, via PyBytes_FromStringAndSize(NULL, nbytes)
, PyBytes_AS_STRING
and Py_SIZE
, an example is the python-zstd extension.
Returning bytes would be marginally better for compatibility, e.g., the HDFS mapping implementation can only handle bytes, and so needs to copy array to bytes if given array.
Consider increasing the default Blosc configuration to allow Blosc to use up to 8 cores if available.
Support setting of key/value attributes on arrays, including persistent arrays.
Currently there is a NoneCompressor class which provides no compression. This could be removed if the Array class explicitly handles the case where the compressor is None, resulting in some code simplification.
Link to the HDFSMap in API docs when it becomes available.
Add support for appending to axis other than 0.
Make fasteners a conditional import.
Skip the zict tests if zict is not installed, to make the conda-forge setup easier.
...and any other doco TODOs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.