Giter Site home page Giter Site logo

gbeckers / darr Goto Github PK

View Code? Open in Web Editor NEW
19.0 3.0 2.0 1.93 MB

A Python library for numpy arrays that persist on disk in a format that is simple, self-documented and tool-independent, and maximizes universal readability.

License: Other

Python 100.00%
python data-science science data-storage interoperability numeric array bsd-3-clause data-sharing ragged-array matlab r julia-language mathematica maple idl octave jagged-array scilab

darr's Introduction

Darr

Github CI Status Appveyor Status PyPi version Conda Forge Codecov Badge Docs Status Zenodo Badge

Darr is a Python science library to work efficiently with potentially very large, disk-based Numpy arrays that are widely readable and self-documented. Every array has its own documentation that includes copy-paste ready code to read it in many popular data science languages, such as R, Julia, Scilab, IDL, Matlab, Maple, and Mathematica, or in Python/Numpy without Darr. Your numerical arrays can be read in other analysis environments with minimal effort and without any need for exporting/copying data.

In essence, Darr makes it trivially easy to share your numerical arrays and metadata with others or with yourself when working in different computing environments, and stores them in a future-proof way.

Universal readability of data is a pillar of good scientific practice. It is also generally a good idea for anyone who wants to save data for the longer term, who wants to flexibly move between analysis environments, or who wants to share data with others without spending much time on figuring out and/or explaining how the receiver can read it. Want to quickly try out an algorithm your colleague wrote in R or Matlab, but no idea how to read your 7-dimensional uint32 numpy array in those environments? A quick copy-paste of code from the documentation included with the array is all that is needed to read it (see example). No need to export anything. Want to share your array with non-Python colleagues? No looking up things, no need to make notes or to provide elaborate explanation. No dependence on complicated formats or specialized libraries.

More rationale for a tool-independent approach to numeric array storage is provided here.

Under the hood, Darr uses NumPy memory-mapped arrays, which is a widely established and trusted way of working with disk-based numerical data, and which makes Darr fully NumPy compatible. This enables efficient out-of-core read/write access to potentially very large arrays. In addition to automatic documentation, Darr adds other functionality to NumPy's memmap, such as easy the appending and truncating of data, support for ragged arrays, the ability to create arrays from iterators, and easy use of metadata. When you change the size of your array, its documentation is automatically kept up to date. Flat binary files and (JSON) text files are accompanied by a README text file that explains how the array and metadata are stored (see example arrays).

See this tutorial for a brief introduction, or the documentation for more info.

Darr is currently pre-1.0, and still undergoing development. It is open source and freely available under the New BSD License terms.

Features

  • Data is stored purely based on flat binary and text files, maximizing universal readability.
  • Automatic self-documention, including copy-paste ready code snippets for reading the array in a number of popular data analysis environments, such as Python (without Darr), R, Julia, Scilab, Octave/Matlab, GDL/IDL, and Mathematica (see example array).
  • Disk-persistent array data is directly accessible through NumPy indexing and may be larger than RAM
  • Easy and efficient appending of data (see example).
  • Supports ragged arrays.
  • Easy use of metadata, stored in a widely readable separate JSON text file (see example).
  • Many numeric types are supported: (u)int8-(u)int64, float16-float64, complex64, complex128.
  • Integrates easily with the Dask library for out-of-core computation on very large arrays.
  • Minimal dependencies, only NumPy.

Limitations

  • No structured (record) arrays supported yet, just ndarrays
  • No string data, just numeric.
  • No compression, although compression for archiving purposes is supported.
  • Uses multiple files per array, as binary data is separated from text documentation and metadata. This can be a disadvantage in terms of storage space if you have very many very small arrays.

Installation

Darr officially depends on Python 3.9 or higher. Older versions may work (probably >= 3.6) but are not tested.

Install Darr from PyPI:

$ pip install darr

Or, install Darr via conda:

$ conda install -c conda-forge darr

To install the latest development version, use pip with the latest GitHub master:

$ pip install git+https://github.com/gbeckers/darr@master

Documentation

See the documentation for more information.

Contributing

Any help / suggestions / ideas / contributions are welcome and very much appreciated. For any comment, question, or error, please open an issue or propose a pull request.

Other interesting projects

If Darr is not exactly what you are looking for, have a look at these projects:

Darr is BSD licensed (BSD 3-Clause License). (c) 2017-2023, Gabriël Beckers

darr's People

Contributors

codacy-badger avatar gbeckers avatar gjlbeckers-uu avatar pyup-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

darr's Issues

implement reading code for types that are not directly supported

Not all languages can directly read all numeric types. Sometimes this is quite inconvenient, for example when working with float16 data that Matlab cannot read as such, or complex types. Matlab has these types, but cannot read them. Darr should produce read code for those cases that circumvents this problem .

implement memmap code where possible

Investigate and implement if possible read code in other languages (at least R and hopefully Matlab and Julia) based on a memory mapped file.

improve README of RaggedArrays

README now is unnecessarily complex if the subarrays are 1-dimensions. Also, right at the topic specific information of the shape of the subarrays may be included to make things less abstract.

write specification

Although it would be short, it would be good to have a description in the documentation what the specification of a Darr Array (and RaggedArray) is.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.