asdf-format / asdf Goto Github PK

View Code? Open in Web Editor NEW

509.0 28.0 55.0 6.24 MB

ASDF (Advanced Scientific Data Format) is a next generation interchange format for scientific data

Home Page: http://asdf.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

asdf advanced-scientific-data-format astronomy jwst astropy

asdf's Introduction

ASDF - Advanced Scientific Data Format

The Advanced Scientific Data Format (ASDF) is a next-generation interchange format for scientific data. This package contains the Python implementation of the ASDF Standard. More information on the ASDF Standard itself can be found here.

The ASDF format has the following features:

A hierarchical, human-readable metadata format (implemented using YAML)
Numerical arrays are stored as binary data blocks which can be memory mapped. Data blocks can optionally be compressed.
The structure of the data can be automatically validated using schemas (implemented using JSON Schema)
Native Python data types (numerical types, strings, dicts, lists) are serialized automatically
ASDF can be extended to serialize custom data types

ASDF is under active development on github. More information on contributing can be found below.

Overview

This section outlines basic use cases of the ASDF package for creating and reading ASDF files.

Creating a file

We're going to store several numpy arrays and other data to an ASDF file. We do this by creating a "tree", which is simply a dict, and we provide it as input to the constructor of `AsdfFile`:

import asdf
import numpy as np

# Create some data
sequence = np.arange(100)
squares = sequence**2
random = np.random.random(100)

# Store the data in an arbitrarily nested dictionary
tree = {
    "foo": 42,
    "name": "Monty",
    "sequence": sequence,
    "powers": {"squares": squares},
    "random": random,
}

# Create the ASDF file object from our data tree
af = asdf.AsdfFile(tree)

# Write the data to a new file
af.write_to("example.asdf")

If we open the newly created file's metadata section, we can see some of the key features of ASDF on display:

#ASDF 1.0.0
#ASDF_STANDARD 1.2.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 2.0.0}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: {name: asdf, version: 2.0.0}
foo: 42
name: Monty
powers:
  squares: !core/ndarray-1.0.0
    source: 1
    datatype: int64
    byteorder: little
    shape: [100]
random: !core/ndarray-1.0.0
  source: 2
  datatype: float64
  byteorder: little
  shape: [100]
sequence: !core/ndarray-1.0.0
  source: 0
  datatype: int64
  byteorder: little
  shape: [100]
...

The metadata in the file mirrors the structure of the tree that was stored. It is hierarchical and human-readable. Notice that metadata has been added to the tree that was not explicitly given by the user. Notice also that the numerical array data is not stored in the metadata tree itself. Instead, it is stored as binary data blocks below the metadata section (not shown above).

It is possible to compress the array data when writing the file:

af.write_to("compressed.asdf", all_array_compression="zlib")

The built-in compression algorithms are 'zlib', and 'bzp2'. The 'lz4' algorithm becomes available when the lz4 package is installed. Other compression algorithms may be available via extensions.

Reading a file

To read an existing ASDF file, we simply use the top-level open function of the asdf package:

import asdf

af = asdf.open("example.asdf")

The open function also works as a context handler:

with asdf.open("example.asdf") as af:
    ...

To get a quick overview of the data stored in the file, use the top-level AsdfFile.info() method:

>>> import asdf
>>> af = asdf.open("example.asdf")
>>> af.info()
root (AsdfObject)
├─asdf_library (Software)
│ ├─author (str): The ASDF Developers
│ ├─homepage (str): http://github.com/asdf-format/asdf
│ ├─name (str): asdf
│ └─version (str): 2.8.0
├─history (dict)
│ └─extensions (list)
│   └─[0] (ExtensionMetadata)
│     ├─extension_class (str): asdf.extension.BuiltinExtension
│     └─software (Software)
│       ├─name (str): asdf
│       └─version (str): 2.8.0
├─foo (int): 42
├─name (str): Monty
├─powers (dict)
│ └─squares (NDArrayType): shape=(100,), dtype=int64
├─random (NDArrayType): shape=(100,), dtype=float64
└─sequence (NDArrayType): shape=(100,), dtype=int64

The AsdfFile behaves like a Python dict, and nodes are accessed like any other dictionary entry:

>>> af["name"]
'Monty'
>>> af["powers"]
{'squares': <array (unloaded) shape: [100] dtype: int64>}

Array data remains unloaded until it is explicitly accessed:

>>> af["powers"]["squares"]
array([   0,    1,    4,    9,   16,   25,   36,   49,   64,   81,  100,
        121,  144,  169,  196,  225,  256,  289,  324,  361,  400,  441,
        484,  529,  576,  625,  676,  729,  784,  841,  900,  961, 1024,
       1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849,
       1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 2704, 2809, 2916,
       3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 3969, 4096, 4225,
       4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 5476, 5625, 5776,
       5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 7225, 7396, 7569,
       7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 9216, 9409, 9604,
       9801])

>>> import numpy as np
>>> expected = [x**2 for x in range(100)]
>>> np.equal(af["powers"]["squares"], expected).all()
True

By default, uncompressed data blocks are memory mapped for efficient access. Memory mapping can be disabled by using the copy_arrays option of open when reading:

af = asdf.open("example.asdf", copy_arrays=True)

For more information and for advanced usage examples, see the documentation.

Extending ASDF

Out of the box, the asdf package automatically serializes and deserializes native Python types. It is possible to extend asdf by implementing custom tags that correspond to custom user types. More information on extending ASDF can be found in the official documentation.

Installation

Stable releases of the ASDF Python package are registered at PyPi. The latest stable version can be installed using pip:

$ pip install asdf

The latest development version of ASDF is available from the main branch on github. To clone the project:

$ git clone https://github.com/asdf-format/asdf

To install:

$ cd asdf
$ pip install .

To install in development mode:

$ pip install -e .

Testing

To install the test dependencies from a source checkout of the repository:

$ pip install -e ".[tests]"

To run the unit tests from a source checkout of the repository:

$ pytest

It is also possible to run the test suite from an installed version of the package.

$ pip install "asdf[tests]"
$ pytest --pyargs asdf

It is also possible to run the tests using tox.

$ pip install tox

To list all available environments:

$ tox -va

To run a specific environment:

$ tox -e <envname>

Documentation

More detailed documentation on this software package can be found here.

More information on the ASDF Standard itself can be found here.

There are two mailing lists for ASDF:

asdf-users
asdf-developers

If you are looking for the Adaptable Seismic Data Format, information can be found here.

License

ASDF is licensed under a BSD 3-clause style license. See LICENSE.rst for the licenses folder for licenses for any included software.

Contributing

We welcome feedback and contributions to the project. Contributions of code, documentation, or general feedback are all appreciated. Please follow the contributing guidelines to submit an issue or a pull request.

We strive to provide a welcoming community to all of our users by abiding to the Code of Conduct.

asdf's People

Contributors

Stargazers

Watchers

Forkers

mdboom embray ejeschke nden migueldvb aarchiba astrofrog bsipocz brechmos-stsci adamchainz krischer drdavella vmarkovtsev guillemdb jhunkeler cadair dhomeier bernie-simon landingellipse lamby harshmathur1990 chenhaomagnetic eteq jdavies-st eslavich sailfish009 larrybradley fengyili1102 jwuphysics lgarrison cagtayfabry vartagg perrygreenfield superligen stscieisenhamer williamjamieson bnavigator zanecodes pllim j-maxey kenmighell braingram monad-one muharremokutan yuenxq zacharyburnett meeseeksmachine arpitjain799 kmacdonald-stsci mspacek eegkit eschnett lxybtc

asdf's Issues

Make it easier to validate custom tags

Add test for TaggedTime class addtion

The code added in pull request #198 does not have a test, due to time constraints. One should be added.

docs building error with sphinx 1.3.1

Building the HTML documentation with sphinx 1.3.1 and the latest astropy-helpers fails with the following error:

# Sphinx version: 1.3.1
# Python version: 3.4.3 (CPython)
# Docutils version: 0.12 release
# Jinja2 version: 2.7.3
# Last messages:
#   11 added, 0 changed, 0 removed
#   reading sources... [  9%] api/pyasdf.AsdfExtension
#   reading sources... [ 18%] api/pyasdf.AsdfFile
#   reading sources... [ 27%] api/pyasdf.AsdfType
#   reading sources... [ 36%] api/pyasdf.Stream
#   reading sources... [ 45%] api/pyasdf.ValidationError
#   reading sources... [ 54%] api/pyasdf.fits_embed.AsdfInFits
#   reading sources... [ 63%] api/pyasdf.open
#   reading sources... [ 72%] api/pyasdf.test
#   reading sources... [ 81%] index
# Loaded extensions:
#   sphinx.ext.autosummary (1.3.1) from /usr/lib/python3.4/site-packages/sphinx/ext/autosummary/__init__.py
#   sphinx.ext.pngmath (1.3.1) from /usr/lib/python3.4/site-packages/sphinx/ext/pngmath.py
#   astropy_helpers.sphinx.ext.changelog_links (unknown version) from /home/mdevalbo/.local/lib/python3.4/site-packages/astropy_helpers-1.1.dev427-py3.4.egg/astropy_helpers/sphinx/ext/changelog_links.py
#   sphinx.ext.autodoc (1.3.1) from /usr/lib/python3.4/site-packages/sphinx/ext/autodoc.py
#   sphinx.ext.coverage (1.3.1) from /usr/lib/python3.4/site-packages/sphinx/ext/coverage.py
#   sphinx.ext.inheritance_diagram (1.3.1) from /usr/lib/python3.4/site-packages/sphinx/ext/inheritance_diagram.py
#   astropy_helpers.sphinx.ext.doctest (unknown version) from /home/mdevalbo/.local/lib/python3.4/site-packages/astropy_helpers-1.1.dev427-py3.4.egg/astropy_helpers/sphinx/ext/doctest.py
#   astropy_helpers.sphinx.ext.autodoc_enhancements (unknown version) from /home/mdevalbo/.local/lib/python3.4/site-packages/astropy_helpers-1.1.dev427-py3.4.egg/astropy_helpers/sphinx/ext/autodoc_enhancements.py
#   example (unknown version) from sphinxext/example.py
#   astropy_helpers.sphinx.ext.viewcode (unknown version) from /home/mdevalbo/.local/lib/python3.4/site-packages/astropy_helpers-1.1.dev427-py3.4.egg/astropy_helpers/sphinx/ext/viewcode.py
#   astropy_helpers.sphinx.ext.automodsumm (unknown version) from /home/mdevalbo/.local/lib/python3.4/site-packages/astropy_helpers-1.1.dev427-py3.4.egg/astropy_helpers/sphinx/ext/automodsumm.py
#   astropy_helpers.sphinx.ext.numpydoc (unknown version) from /home/mdevalbo/.local/lib/python3.4/site-packages/astropy_helpers-1.1.dev427-py3.4.egg/astropy_helpers/sphinx/ext/numpydoc.py
#   astropy_helpers.sphinx.ext.astropyautosummary (unknown version) from /home/mdevalbo/.local/lib/python3.4/site-packages/astropy_helpers-1.1.dev427-py3.4.egg/astropy_helpers/sphinx/ext/astropyautosummary.py
#   sphinx.ext.graphviz (1.3.1) from /usr/lib/python3.4/site-packages/sphinx/ext/graphviz.py
#   matplotlib.sphinxext.plot_directive (unknown version) from /usr/lib/python3.4/site-packages/matplotlib/sphinxext/plot_directive.py
#   astropy_helpers.sphinx.ext.smart_resolver (unknown version) from /home/mdevalbo/.local/lib/python3.4/site-packages/astropy_helpers-1.1.dev427-py3.4.egg/astropy_helpers/sphinx/ext/smart_resolver.py
#   astropy_helpers.sphinx.ext.tocdepthfix (unknown version) from /home/mdevalbo/.local/lib/python3.4/site-packages/astropy_helpers-1.1.dev427-py3.4.egg/astropy_helpers/sphinx/ext/tocdepthfix.py
#   astropy_helpers.sphinx.ext.automodapi (unknown version) from /home/mdevalbo/.local/lib/python3.4/site-packages/astropy_helpers-1.1.dev427-py3.4.egg/astropy_helpers/sphinx/ext/automodapi.py
#   sphinx.ext.todo (1.3.1) from /usr/lib/python3.4/site-packages/sphinx/ext/todo.py
#   alabaster (0.7.4) from /usr/lib/python3.4/site-packages/alabaster/__init__.py
#   sphinx.ext.intersphinx (1.3.1) from /usr/lib/python3.4/site-packages/sphinx/ext/intersphinx.py
Traceback (most recent call last):
  File "/usr/lib/python3.4/site-packages/sphinx/cmdline.py", line 245, in main
    app.build(opts.force_all, filenames)
  File "/usr/lib/python3.4/site-packages/sphinx/application.py", line 264, in build
    self.builder.build_update()
  File "/usr/lib/python3.4/site-packages/sphinx/builders/__init__.py", line 245, in build_update
    'out of date' % len(to_build))
  File "/usr/lib/python3.4/site-packages/sphinx/builders/__init__.py", line 259, in build
    self.doctreedir, self.app))
  File "/usr/lib/python3.4/site-packages/sphinx/environment.py", line 618, in update
    self._read_serial(docnames, app)
  File "/usr/lib/python3.4/site-packages/sphinx/environment.py", line 638, in _read_serial
    self.read_doc(docname, app)
  File "/usr/lib/python3.4/site-packages/sphinx/environment.py", line 791, in read_doc
    pub.publish()
  File "/usr/lib/python3.4/site-packages/docutils/core.py", line 218, in publish
    self.apply_transforms()
  File "/usr/lib/python3.4/site-packages/docutils/core.py", line 199, in apply_transforms
    self.document.transformer.apply_transforms()
  File "/usr/lib/python3.4/site-packages/docutils/transforms/__init__.py", line 171, in apply_transforms
    transform.apply(**kwargs)
  File "/usr/lib/python3.4/site-packages/sphinx/transforms.py", line 129, in apply
    if has_child(node.parent, nodes.caption):
  File "/usr/lib/python3.4/site-packages/sphinx/transforms.py", line 116, in has_child
    return any(isinstance(child, cls) for child in node)
TypeError: 'NoneType' object is not iterable

Add support for byte string and unicode types in ndarray

The spec needs to add this, too.

YAML formatting issues

This is a placeholder to remind to deal with some issues with how the YAML is output.

Since YAML has multiple ways to represent the same thing, there are cases where it might be preferable to use one form over another. Currently, pyasdf does "whatever PyYAML does by default".

There are (at least) three separate things to consider here:

When reading an input file, preserving the form of each of the input entries when writing back out
When generating a file from scratch, using hints in the schema to select an output form
Allowing the user to explicitly specify the form of the output on an individual item basis

Inconsistent naming of read/write_to?

At the moment, the methods for reading and writing are read and write_to. These seem inconsistent, and I wonder if it should be either read_from/write_to or read/write? (the latter would be my preferred choice).

Failed test in transform schema

One of the schema tests fails with a KeyError. This is the end of the traceback output from python setup.py test:

cls = <class 'pyasdf.tags.transform.projections.Rotate3DType'>
node = {'phi': 12.3, 'psi': -1.2, 'theta': 34}
ctx = <pyasdf.asdf.AsdfFile object at 0x7f7e2f75d390>

    @classmethod
    def from_tree_transform(cls, node, ctx):
        print(node)
>       if node['direction'] == 'native2celestial':
E       KeyError: 'direction'

pyasdf/tags/transform/projections.py:83: KeyError
From file: rotate3d.yaml
=================== 1 failed, 222 passed in 16.35 seconds ====================

windows testing?

I am trying to set this up on Windows and I get this error below. Has this been run on Windows? It's possible I did not install it correctly.


In [4]: f=AsdfFile()
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-4-f4d43ec31023> in <module>()
----> 1 f=AsdfFile()

C:\Anaconda\envs\gwcs\lib\site-packages\pyasdf-0.0.dev0-py2.7.egg\pyasdf\asdf.py
c in __init__(self, tree, uri, extensions)
     58         self._blocks = block.BlockManager(self)
     59         if tree is None:
---> 60             self.tree = {}
     61             self._uri = uri
     62         elif isinstance(tree, AsdfFile):

C:\Anaconda\envs\gwcs\lib\site-packages\pyasdf-0.0.dev0-py2.7.egg\pyasdf\asdf.py
c in tree(self, tree)
    181         tagged_tree = yamlutil.custom_tree_to_tagged_tree(
    182             AsdfObject(tree), self)
--> 183         schema.validate(tagged_tree, self)
    184         self._tree = AsdfObject(tree)
    185

C:\Anaconda\envs\gwcs\lib\site-packages\pyasdf-0.0.dev0-py2.7.egg\pyasdf\schema.
pyc in validate(instance, ctx, *args, **kwargs)
    281     # test suite!!!).  Instead, we assume that the schemas are valid
    282     # through the running of the unit tests, not at run time.
--> 283     cls = _create_validator()
    284     validator = cls({}, *args, **kwargs)
    285     validator.ctx = ctx

C:\Anaconda\envs\gwcs\lib\site-packages\pyasdf-0.0.dev0-py2.7.egg\pyasdf\schema.
pyc in _create_validator()
    156         meta_schema=load_schema(
    157             'http://stsci.edu/schemas/yaml-schema/draft-01',
--> 158             mresolver.default_url_mapping),
    159         validators=YAML_VALIDATORS)
    160     validator.orig_iter_errors = validator.iter_errors

C:\Anaconda\envs\gwcs\lib\site-packages\pyasdf-0.0.dev0-py2.7.egg\pyasdf\schema.
pyc in load_schema(url, resolver)
    245         resolver = mresolver.default_url_mapping
    246     loader = _make_schema_loader(resolver)
--> 247     return loader(url)
    248
    249

C:\Anaconda\envs\gwcs\lib\site-packages\pyasdf-0.0.dev0-py2.7.egg\pyasdf\schema.
pyc in load_schema(url)
    223     def load_schema(url):
    224         url = resolver(url)
--> 225         return _load_schema(url)
    226     return load_schema
    227

C:\Anaconda\envs\gwcs\lib\site-packages\pyasdf-0.0.dev0-py2.7.egg\pyasdf\compat\
functools_backport.pyc in wrapper(*args, **kwds)
    115                         stats[HITS] += 1
    116                         return result
--> 117                 result = user_function(*args, **kwds)
    118                 with lock:
    119                     root, = nonlocal_root

C:\Anaconda\envs\gwcs\lib\site-packages\pyasdf-0.0.dev0-py2.7.egg\pyasdf\schema.
pyc in _load_schema(url)
    212 @lru_cache()
    213 def _load_schema(url):
--> 214     with generic_io.get_file(url) as fd:
    215         if isinstance(url, six.text_type) and url.endswith('json'):
    216             result = json.load(fd)

C:\Anaconda\envs\gwcs\lib\site-packages\pyasdf-0.0.dev0-py2.7.egg\pyasdf\generic
_io.pyc in get_file(init, mode, uri)
   1014                 realmode = mode + 'b'
   1015             return RealFile(
-> 1016                 open(parsed.path, realmode), mode, close=True,
   1017                 uri=uri or parsed.path)
   1018

IOError: [Errno 2] No such file or directory: u'/stsci.edu/yaml-schema/draft-01.
yaml'

Feature request: auto-close files when using context managers

In the following example:

import numpy as np
from pyasdf import AsdfFile

tree = {'test': np.array([1,2,3])}

f = AsdfFile(tree)
f.set_array_storage(tree['test'], 'inline')
f.write_to('data.asdf')

for i in range(1000):
    with AsdfFile.read('masked.asdf') as f2:
        np.sum(f2.tree['test'])

I am running into:

OSError: [Errno 24] Too many open files

It would be nice if read could work as a normal context manager and auto-close the file.

overwriting asdf files

This isn't urgent, I am simply reporting it so it's not forgotten.
When attempting to overwrite an asdf file on Windows I get an error:

In [17]: f1.write_to('foc2sky.asdf')
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-17-1418e770a532> in <module>()
----> 1 f1.write_to('foc2sky.asdf')

C:\Anaconda\envs\gwcs\lib\site-packages\pyasdf-0.0.dev417-py2.7.egg\pyasdf\asdf.
pyc in write_to(self, fd, all_array_storage, all_array_compression, auto_inline,
 pad_blocks)
    586         original_fd = self._fd
    587
--> 588         self._fd = fd = generic_io.get_file(fd, mode='w')
    589
    590         self._pre_write(fd, all_array_storage, all_array_compression,

C:\Anaconda\envs\gwcs\lib\site-packages\pyasdf-0.0.dev417-py2.7.egg\pyasdf\gener
ic_io.pyc in get_file(init, mode, uri)
   1033             realpath = url2pathname(parsed.path)
   1034             return RealFile(
-> 1035                 open(realpath, realmode), mode, close=True,
   1036                 uri=uri)
   1037

IOError: [Errno 22] invalid mode ('wb') or filename: 'foc2sky.asdf'

On Linux the file is silently overwritten. Perhaps add a clobber argument to write_to similar to fits.

Add stream to disk to memmap mode

When reading from an input stream, we could write the content to a temporary file and memmap that to reduce memory consumption.

Change affine transform

affine.yaml requires a 3x3 matrix for the affine transformation. I think the idea was that this may go to higher dimensions in the future. Pyasdf splits this into matrix and translation in order to initialize modeling.AffineTransformation2D which uses the two quantities separately.

I was assuming pyasdf was writing to disk the augmented matrix but it simply attaches the translation part as an additional row to the matrix. This is confusing.

array([[   0.92913257,   -0.36974676,  100.        ],
       [   0.36974676,    0.92913257,   20.        ],
       [   0.        ,    0.        ,    1.        ]])

matrix: !core/ndarray
    data:
    - [0.92913257, -0.36974676, 0.0]
    - [0.36974676, 0.92913257, 0.0]
    - [100.0, 20.0, 0.0]
    datatype: float64
    shape: [3, 3]

In addition, modeling.AffineTranslation2D is the only option currently that can be used to apply the PC matrix in WCS transformations. However I am only able to write a 3x3 matrix as an affine transformation. This is also confusing because it may mean the data is 3 dimensional in the WCS context.

Any ideas how to resolve this? This is also related to astropy.modeling issue #3548 which would solve this problem if accepted.

compound transforms loose attributes

This only happens with compound transforms. Attributes like name and inverse are lost.

offx = models.Shift(1)
scl = models.Scale(2)
model = (offx | scl).rename('compound_model')
f = AsdfFile()
f.tree['model'] = model
f.write_to('test.asdf')
f1 = AsdfFile.read('test.asdf')
f1.tree['model'].name
model.name
Out[97]: 'compound_model'

AsdfInFits API enhancements

Two possible enhancements I envision to the API when working with FITS-embedded-ASDF:

AsdfInFits.open currently just accepts an existing HDUList object as its first argument. This means that when reading from a FITS file on disk one has to:

from pyasdf.fits_embed import AsdfInFits
from astropy.io import fits
asdf = AsdfInFits.open(fits.open('filename.fits'))

The two open calls are silly--AsdfInFits.open could easily accept any filename or other object accepted by fits.open.

Relatedly, I think it should be possible to read ASDF directly from a FITS file with pyasdf.open. It's easy to detect that the input is a FITS file (instead of a true ASDF file), and just as easy to detect that it's using the ASDF-embedded-in-FITS convention (which I think should be part of the ASDF Standard if it isn't already, albeit may in an index since it's really more of a usage convention that part of ASDF itself).

Confusion with asdf / pyasdf package name

This package conflicts with another one:

Python package for astro data ASDF format:

PyPI: https://pypi.python.org/pypi/asdf
Repo: https://github.com/spacetelescope/pyasdf
Install: pyasdf

Python package for Seismic data ASDF format:

PyPI: https://pypi.python.org/pypi/pyasdf
Repo: https://github.com/SeismicData/pyasdf
Install: pyasdf

@krischer @mdboom @embray This is very confusing. Any chance to still change something to avoid the name conflict / simplify this?

JSON Schema validation performance

JSON schema validation currently takes 60% of load time on a benchmark with 10000 arrays.

Unlike the YAML parsing where there was a lot of low-hanging fruit, in JSON schema things are tricky. It's hard to figure out what to do to improve the performance of jsonschema without obliterating its really clean architecture.

Relatedly, I experimented with adding a flag to turn of JSON schema validation. The problem is that then many of the type converters become more brittle in interesting ways because they don't do their own validation that the JSON schema is currently doing for them. Duplicating that work seems like a way to only make things slower, so not sure what to do there.

Support external sources

It's in the spec -- not yet implemented.

Handle complex256 and float128 types in non-native byteorder on Python 3

Obscure enough? Low priority bug.

astropy dependency

Can astropy be an optional dependency? Communities outside astronomy might be interested in using the file format.

Get arrays from BytesIO objects without a copy

Is it even possible?

Support ``asdf`` interface?

While playing around with trying to store Astropy objects in ASDF (e.g. astropy/astropy#3733) I was wondering how we might best make it easy for developers to make their kinds of objects storable in ASDF. One option would be to check if objects have a __asdf__ method that if called will return a valid ASDF structure that can be included in a file? (with all the correct meta-data).

SystemError when running tests with numpy-1.11 b2

When running with the 1.11 b2 beta release of numpy (Debian unstable), I get a couple of errors like

____________________________ test_streams[tree0] ______________________________

tree = {'not_shared': array([10,  9,  8,  7,  6,  5,  4,  3,  2,  1], dtype=uint8),
     'science_data': array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.]), 'skipping

    def test_streams(tree):
        buff = io.BytesIO()

        def get_write_fd():
            return generic_io.OutputStream(buff)

        def get_read_fd():
            buff.seek(0)
            return generic_io.InputStream(buff, 'rw')

>       with _roundtrip(tree, get_write_fd, get_read_fd) as ff:

pyasdf/tests/test_generic_io.py:226: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyasdf/tests/test_generic_io.py:59: in _roundtrip
    ff = asdf.AsdfFile.open(fd, **read_options)
pyasdf/asdf.py:533: in open
    do_not_fill_defaults=do_not_fill_defaults)
pyasdf/asdf.py:475: in _open_impl
    fd, past_magic=True, validate_checksums=validate_checksums)
pyasdf/block.py:243: in read_internal_blocks
    block = self._read_next_internal_block(fd, past_magic=past_magic)
pyasdf/block.py:211: in _read_next_internal_block
    validate_checksum=self._validate_checksums)
pyasdf/block.py:989: in read
    fd, self._size, self._data_size, self.compression)
pyasdf/block.py:1002: in _read_data
    return fd.read_into_array(used_size)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pyasdf.generic_io.InputStream object at 0x7f4b413b5710>, size = 80

    def read_into_array(self, size):
        try:
            # See if Numpy can handle this as a real file first...
>           return np.fromfile(self._fd, np.uint8, size)
E           SystemError: error return without exception set

pyasdf/generic_io.py:870: SystemError

This happens for teststreams, testurlopen, testhttpconnection, testexplodedhttp, testseekuntilonblockboundary, teststreamtostream, and testarraytostream.
With numpy-1.10 (Debian testing) the tests run fine.

I have no idea whether this is an issue for numpy or for pyasdf; I will forward is to numpy as well.

Remove astropy as hard dependency

Support non-memmappable types

Examples would be a packed bit format, or anything where the "view" is not exactly the same as the data.

We should try to do this right, if possible, and not repeat the mistakes of pyfits.

Warn when encountering unknown tags

Not sure if we do this by default, but it would be useful.

Change RTD domain name

Change all references of readthedocs.org to readthedocs.io. This is not urgent but should happen eventually.

Add support for `set` types

This will need an extension to jsonschema, as well as an extension to the tagged.py module.

Reconsitute FITS files correctly

Right now, it can only handle Image HDUs, but it should determine which HDUs to create based on the header content.

Diverged release version numbers on github and pypi

It looks like that the package version numbering started to diverge between github and pypi with release 1.0.2 here is being the one uploaded as 1.1 to pypi (some discussion on it in #190).

Since then there are two release tags 1.0.3 and 1.0.4 that seem to be a continuation of the 1.1 version but not uploaded to pypi.

Bug when read storing masked array inline

The following code:

from numpy import ma
from pyasdf import AsdfFile

tree = {'test': ma.array([1,2,3], mask=[0,1,0])}

f = AsdfFile(tree)
f.set_array_storage(tree['test'], 'inline')
f.write_to('masked.asdf')

f2 = AsdfFile.read('masked.asdf')

triggers the following exception:

Traceback (most recent call last):
  File "buggy.py", line 10, in <module>
    f2 = AsdfFile.read('masked.asdf')
  File "/Users/tom/miniconda3/envs/production/lib/python3.4/site-packages/pyasdf-0.0.dev426-py3.4.egg/pyasdf/asdf.py", line 392, in read
    yaml_content, self, do_not_fill_defaults=do_not_fill_defaults)
  File "/Users/tom/miniconda3/envs/production/lib/python3.4/site-packages/pyasdf-0.0.dev426-py3.4.egg/pyasdf/yamlutil.py", line 269, in load_tree
    tree = tagged_tree_to_custom_tree(tree, ctx)
  File "/Users/tom/miniconda3/envs/production/lib/python3.4/site-packages/pyasdf-0.0.dev426-py3.4.egg/pyasdf/yamlutil.py", line 245, in tagged_tree_to_custom_tree
    return treeutil.walk_and_modify(tree, walker)
  File "/Users/tom/miniconda3/envs/production/lib/python3.4/site-packages/pyasdf-0.0.dev426-py3.4.egg/pyasdf/treeutil.py", line 99, in walk_and_modify
    return recurse(top, set())
  File "/Users/tom/miniconda3/envs/production/lib/python3.4/site-packages/pyasdf-0.0.dev426-py3.4.egg/pyasdf/treeutil.py", line 84, in recurse
    result[key] = recurse(val, new_seen)
  File "/Users/tom/miniconda3/envs/production/lib/python3.4/site-packages/pyasdf-0.0.dev426-py3.4.egg/pyasdf/treeutil.py", line 95, in recurse
    result = callback(result)
  File "/Users/tom/miniconda3/envs/production/lib/python3.4/site-packages/pyasdf-0.0.dev426-py3.4.egg/pyasdf/yamlutil.py", line 242, in walker
    return tag_type.from_tree_tagged(node, ctx)
  File "/Users/tom/miniconda3/envs/production/lib/python3.4/site-packages/pyasdf-0.0.dev426-py3.4.egg/pyasdf/asdftypes.py", line 180, in from_tree_tagged
    return cls.from_tree(tree.data, ctx)
  File "/Users/tom/miniconda3/envs/production/lib/python3.4/site-packages/pyasdf-0.0.dev426-py3.4.egg/pyasdf/stream.py", line 45, in from_tree
    return ndarray.NDArrayType.from_tree(data, ctx)
  File "/Users/tom/miniconda3/envs/production/lib/python3.4/site-packages/pyasdf-0.0.dev426-py3.4.egg/pyasdf/tags/core/ndarray.py", line 343, in from_tree
    return cls(source, shape, dtype, offset, strides, 'C', mask, ctx)
  File "/Users/tom/miniconda3/envs/production/lib/python3.4/site-packages/pyasdf-0.0.dev426-py3.4.egg/pyasdf/tags/core/ndarray.py", line 200, in __init__
    self._array = inline_data_asarray(source, dtype)
  File "/Users/tom/miniconda3/envs/production/lib/python3.4/site-packages/pyasdf-0.0.dev426-py3.4.egg/pyasdf/tags/core/ndarray.py", line 162, in inline_data_asarray
    return np.asarray(inline, dtype=dtype)
  File "/Users/tom/miniconda3/envs/production/lib/python3.4/site-packages/numpy/core/numeric.py", line 462, in asarray
    return array(a, dtype, copy=False, order=order)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Customize paths of external blocks

Is it possible to customize the filenames and subdirectories where the exploded block files are saved?

HDF5

I suppose you considered using HDF5, but decided not to. I would be interested in knowing the reasons why you chose to design a new format over using HDF5. What were the limitations of HDF5 for your use-cases? It might also be a good idea to put those reasons on the documentation, as I'm sure other people would be interested as well.

Add a commandline utility

For the basic operations -- implode/explode/garbage collect/validate.

Regular failure of test_http_connection_range

I get consistent failures in test_http_connection_range when I run the tests locally, on both Python 2 and 3. Obviously since this failure doesn't seem to be occurring on the CI builds it may be an issue local to me. But I'm opening an issue to remember to investigate:

tree = {'more': array([[[ 0.16936457,  0.04898563,  0.68901559, ...,  0.97004914,
          0.....46327186,  0.78642262, ...,...123, ...,  0.8856184 ,
         0.13... 0.00274184,  0.78529121, ...,  0.6853034 ,
         0.08646289,  0.77335592]])}
rhttpserver = <pyasdf.conftest.RangeHTTPServer object at 0x428ea50>

    @pytest.mark.skipif(sys.platform.startswith('win'),
                        reason="Windows firewall prevents test")
    def test_http_connection_range(tree, rhttpserver):
        path = os.path.join(rhttpserver.tmpdir, 'test.asdf')
        connection = [None]

        def get_write_fd():
            return generic_io.get_file(open(path, 'wb'), mode='w')

        def get_read_fd():
            fd = generic_io.get_file(rhttpserver.url + "test.asdf")
            assert isinstance(fd, generic_io.HTTPConnection)
            connection[0] = fd
            return fd

        with _roundtrip(tree, get_write_fd, get_read_fd) as ff:
            if len(tree) == 4:
                assert connection[0]._nreads == 0
            else:
>               assert connection[0]._nreads == 6
E               assert 5 == 6
E                +  where 5 = <pyasdf.generic_io.HTTPConnection object at 0x431e910>._nreads

../../../.virtualenvs/13aecf6e-83d7-40c6-86f5-713fad8a4373/lib/python2.7/site-packages/pyasdf/tests/test_generic_io.py:306: AssertionError
------------------------------------------------- Captured stdout call -------------------------------------------------
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 53619)
----------------------------------------
------------------------------------------------- Captured stderr call -------------------------------------------------
127.0.0.1 - - [08/Jan/2016 15:04:07] "GET /test.asdf HTTP/1.1" 206 -
Traceback (most recent call last):
  File "/internal/1/root/usr/local/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/internal/1/root/usr/local/lib/python2.7/SocketServer.py", line 321, in process_request
    self.finish_request(request, client_address)
  File "/internal/1/root/usr/local/lib/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/internal/1/root/usr/local/lib/python2.7/SocketServer.py", line 651, in __init__
    self.finish()
  File "/internal/1/root/usr/local/lib/python2.7/SocketServer.py", line 710, in finish
    self.wfile.close()
  File "/internal/1/root/usr/local/lib/python2.7/socket.py", line 279, in close
    self.flush()
  File "/internal/1/root/usr/local/lib/python2.7/socket.py", line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
error: [Errno 32] Broken pipe
127.0.0.1 - - [08/Jan/2016 15:04:07] "GET /test.asdf HTTP/1.1" 206 -
127.0.0.1 - - [08/Jan/2016 15:04:07] "GET /test.asdf HTTP/1.1" 206 -
127.0.0.1 - - [08/Jan/2016 15:04:07] "GET /test.asdf HTTP/1.1" 206 -
127.0.0.1 - - [08/Jan/2016 15:04:07] "GET /test.asdf HTTP/1.1" 206 -
127.0.0.1 - - [08/Jan/2016 15:04:07] "GET /test.asdf HTTP/1.1" 206 -

add support for custom extensions in `helpers` functions

I'd like to use the pyasdf testing infrastructure with custom extensions, specifically to test roundtripping.
I've looked into adding the extensions keyword to the assert_roundtrip_tree function (calls to AsdfFile() and AsdfFile.open) but this doesn't seem to be sufficient as the type_index is not updated with the custom types.
Would it be easy to add this functionality?

Context manager clarification

Just out of curiosity, why is the current syntax for the context manager:

ff = AsdfFile(tree)
with ff.write_to("example.asdf"):
    pass

Would it not make sense for it to be:

with AsdfFile(tree) as ff:
    ff.write_to("example.asdf")

? What is the purpose of using write_to as a context manager?:

with ff.write_to("example.asdf"):
    pass

Multiple ASDF extentions in single FITS file?

Afetr some limited playing with fits_embed, I see it can write only one ASDF extension with extname=ASDF. Is there a plan to allow more than one ASDF extension in a fits file with properly managed extname and extver?

Add support for lzma filter

In addition to the supported zlib compression type, it is useful to have support ifor other algorithms like liblzma and bzip2 for lossless data compression. The interface of these modules in the python standard library is similar and lzma has some advantages for some types of data compared to zlib. If these methods are not supported by the standard it would be useful to have user-defined filters to implement the compression.

`write_to` is not thread-safe

This is because historically it changed the underlying file descriptor, but as of #118, it doesn't any longer, but it has to fake it because of underlying assumptions elsewhere. (But that's not a regression, write_to never was threadsafe -- it's just that with the new design it no longer has to be). This should all be cleaned up, but at a fairly low priority.

Python 3.5 ImportError for pyasdf.compat.user_collections_py3

I'm getting this error for Python 3.5:

$ pip install asdf
$ python -c 'import pyasdf; pyasdf.test()'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/deil/Library/Python/3.5/lib/python/site-packages/pyasdf/__init__.py", line 37, in <module>
    from .asdf import AsdfFile
  File "/Users/deil/Library/Python/3.5/lib/python/site-packages/pyasdf/asdf.py", line 15, in <module>
    from . import block
  File "/Users/deil/Library/Python/3.5/lib/python/site-packages/pyasdf/block.py", line 23, in <module>
    from .compat.numpycompat import NUMPY_LT_1_7
  File "/Users/deil/Library/Python/3.5/lib/python/site-packages/pyasdf/compat/__init__.py", line 13, in <module>
    from .user_collections_py3.UserDict import UserDict
ImportError: No module named 'pyasdf.compat.user_collections_py3'

Python 3 leaves uncollected objects

Known issue with jsonschema. Upstream fix here: python-jsonschema/jsonschema#154

transform.name does not roundtrip

This demonstrates the problem:

rot=models.Rotation2D(23, name='rotation')
fa=AsdfFile()
fa.tree={'r': rot}
fa.write_to('rot.asdf')
frot=AsdfFile.read('todel.asdf')
frot.tree['r'].name is None
Out[38]: True

name is already part of the basic transform schema. Is there a way to add this to the general transform to_tree method instead of adding it to every subclass of TransformType?

Scope of ASDF beyond Astronomy?

At the moment, the tagline for the repo is ASDF (Advanced Scientific Data Format) is a next generation interchange format for astronomical data - I wonder if it would be worth making it sound like it would also be useful to other fields, e.g. being developed for astronomical and other scientific data?

does asdf support virtual datasets?

I can't tell from reading the specification what transformations do, so I am asking here. Can I specify a dataset a that draws data from dataset b, but then applies some transformation to the data before returning it? For example b = [1 2 3], a=b+4, so when I read values from a I would read [5 6 7].

The whole discussion of transformations seems tailored to the astronomy community, and is hard to follow if you don't know what wcs is.

Sphinx warnings with sphix 1.3.5

The sphinx build in master currently fails with this kind of warnings:

WARNING: Could not parse literal_block as "yaml". highlighting skipped.

These are issued from code like

.. runcode::

   from asdf import AsdfFile

   # Make the tree structure, and create a AsdfFile from it.
   tree = {'hello': 'world'}
   ff = AsdfFile(tree)
   ff.write_to("test.asdf")

   # You can also make the AsdfFile first, and modify its tree directly:
   ff = AsdfFile()
   ff.tree['hello'] = 'world'
   ff.write_to("test.asdf")

.. asdf:: test.asdf

This is discussed a bit in #190 in this comment.

general questions

@mdboom Two questions that came up while testing this with some real data.

I had a file written with a previous version of pyasdf and I can't read it with the latest version. I can write it and read it back correctly. I understand this is still in development but I'm wondering if there's a way to handle this in the future. I am not suggesting "Once ASDF always ASDF" but I suspect we'll have to handle this in some way in the future so I'm raising it for consideration. One thing that can be done now is perhaps write out the version of Pyasdf that created the file.
The second question is about performance. I have 3 files: "dist.asdf" has a compound model consisting of a few polynomials, "foc2sky.asdf" is the typical WCS transformation consisting of linear transformation, tan deprojection and sky rotation, and "image.asdf" is the above two combined into one file. Here's the timing I get from reading them:

timeit f=AsdfFile.read('dist.asdf')
1 loops, best of 3: 422 ms per loop

timeit f=AsdfFile.read('foc2sky.asdf')
10 loops, best of 3: 54.9 ms per loop

timeit f=AsdfFile.read('image.asdf')
1 loops, best of 3: 8.99 s per loop

Why the big difference?

Stream read the YAML tree

PyYAML doesn't like it when you give it a YAML file with "invalid" (non-UTF8) content following the end marker. Why it tries to read past the end marker at all is sort of beyond me.

The current solution is to read in the entire tree into a string and pass that to PyYAML. It should also be possible to use some sort of reading proxy that treats ... as EOF. That should be more memory efficient when the tree is really large -- though there might be some more overhead to that. Worth experimenting with in any event.

Make the block index and streaming blocks play well together

There was discussion about this in #144.

Including a block index makes it impossible to add to the streaming block after the file is written. I think this is an important use case. However, maybe it also makes sense to allow "freezing" an ASDF file and appending a block index, after which the streaming block could not be updated.

As it stands, the block index and streaming blocks are explicitly disallowed to coexist.

problem using `with` to read a file

@mdboom Should I not use with to open asdf files? Or am I doing something wrong?

ar = np.arange(36).reshape((6,6))
f = AsdfFile()
f.tree = {'regions': ar, 'reftype': 'regions'}
f.write_to('test.asdf')

with AsdfFile.open('test.asdf') as f:
    reg=f.tree['regions']
print reg
<array (unloaded) shape: [6, 6] dtype: int64>

Indicate presence of block index in the Tree

As discussed in #144. Should revisit if the time looking for the block index when it isn't there becomes burdensome.