Giter Site home page Giter Site logo

sparsehdf5's Introduction

SparseHDF5

Lifeboat, LLC. was awarded DOE SBIR Phase I grant to design extentions to the HDF5 software to support efficient access and storage for sparse data. This repository contains the design documents for the HDF5 File Format extensions and public APIs to support the feature.

Sparse data is common in many scientific disciplines and experiments. Several examples are discussed in “Sparse Data Management in HDF5” [1], including High Energy Physics, Neutron and X-ray Scattering, Mass Spectrometry and Compressive Sensing experiments. In those use cases, only 0.1% to 10% of gathered data is of interest. HDF5, due to its proven track record and flexibility, remains the data format of choice. As the amount of data produced continues to grow due to higher instrument and detector resolution and higher sampling rates, there is a clear demand for efficient management of sparse data in HDF5. Adding support for sparse data will significantly simplify data processing software and widen adoption of HDF5.

In HDF5, problem-sized data is stored in multidimensional arrays of elements of a given type. Currently, the HDF5 library requires that all elements are defined with user-supplied values or fill-values, and it treats data as “dense”, mapping each data element to storage during I/O operations. Features such as HDF5 chunking and per-dataset compression help to optimize the storage of sparse data by not storing chunks devoid of user-supplied values and by compressing each chunk written. However, there are several obvious disadvantages to applying “dense storage” thinking to sparse data. Each chunk written may still have a lot of blank data and the location of the actual user-supplied data is not explicitly represented. Also, storing and accessing sparse datasets as dense datasets, when read into memory (and after decompression), may result in a huge memory footprint. Therefore, a different approach to handling HDF5 sparse data in files and in memory is needed.

As prototyped in [1], the proposed approach to sparse data management uses the existing HDF5 selection mechanism to represent sparse datasets, both in memory and in files. Since it will be impractical to hold entire sparse datasets in memory, we break the extent of the sparse dataset into user-specified, regular, n-dimensional hyper-rectangles. A sparse chunk is a hyper-rectangle endowed with an HDF5 selection, which represents all defined entries in its domain. This way, each sparse chunk has a selection (data coordinates) and associated user-defined data. This approach allows us to store only data of interest and to simultaneously operate on several sparse chunks using existing HDF5 facilities for serialization and deserialization, and for constructing partial I/O on sparse data.

Our proposed implementation offers sparse array storage that is independent from in-memory representation of the sparse data thus offering sparse data portability between applications. It also requires minimal changes to applications’ codes.

While the immediate need is for file format and API changes to support sparse data, we have given significant thought to a number of other potential HDF5 enhancements and noticed the commonalities with the sparse data problem. In particular, the idea of extending the concept of the chunk to contain multiple sections that describe different facets of the values stored in the chunk seems to be applicable to a number of problems, for example, to HDF5 variable-length data and non-homogeneous arrays. This in turn raises the problem of compressing these different sections efficiently, as different compression algorithms may be optimal for different sections. Please see the documents in the design_docs directory for more details. We will be very happy to receive community feedback on the proposed designs.

In Phase II (if awarded) we plan to implement the new feature and integrate the solution into the open source HDF5 library.

References:

  1. J. Mainzer, N. Fortner, G. Heber, and others, “Sparse Data Management in HDF5”, November 2019, Conference: 2019 IEEE/ACM 1st Annual Workshop on Large-scale Experiment-in-the-Loop Computing (XLOOP), http://dx.doi.org/10.1109/XLOOP49562.2019.00009

sparsehdf5's People

Contributors

epourmal avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.