Giter Site home page Giter Site logo

ncas-cms / cfdm Goto Github PK

View Code? Open in Web Editor NEW
27.0 6.0 10.0 69.13 MB

A Python reference implementation of the CF data model

Home Page: http://ncas-cms.github.io/cfdm

License: MIT License

Python 99.52% Shell 0.31% TeX 0.16%
cf metadata python netcdf climate forecasting ocean atmosphere atmospheric-science

cfdm's People

Contributors

bewithankit avatar davidhassell avatar ncascms avatar sadielbartholomew avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

cfdm's Issues

`ordered` errors for all but cell method constructs

On the latest master the ordered method on a Constructs class appears to successfully return only for cell methods, returning a ValueError for a Construct object containing any other valid and same-type constructs, e.g:

>>> import cfdm
>>> f = cfdm.example_field(6)
>>> c = f.constructs()
>>> c
<Constructs: auxiliary_coordinate(4), coordinate_reference(1), dimension_coordinate(1), domain_axis(2)>
>>> a = c.filter_by_type('auxiliary_coordinate')
>>> a
<Constructs: auxiliary_coordinate(4)>
>>> a.ordered()
{'cell_method'} {'auxiliary_coordinate'}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sadie/cfdm/cfdm/core/constructs.py", line 1240, in ordered
    raise ValueError(
ValueError: Can't order un-orderable construct type: <Constructs: auxiliary_coordinate(4)>
>>> b.ordered()
{'cell_method'} {'domain_axis'}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sadie/cfdm/cfdm/core/constructs.py", line 1240, in ordered
    raise ValueError(
ValueError: Can't order un-orderable construct type: <Constructs: domain_axis(2)>

where the output lines before the tracebacks are emerging because I added the following print call into the method for debugging:

diff --git a/cfdm/core/constructs.py b/cfdm/core/constructs.py
index 575ffd342..f09490f8d 100644
--- a/cfdm/core/constructs.py
+++ b/cfdm/core/constructs.py
@@ -1237,6 +1235,7 @@ class Constructs(abstract.Container):
                 "Can't order multiple construct types: {!r}".format(self)
             )
 
+        print(self._ordered_constructs, set(self._constructs))
         if self._ordered_constructs != set(self._constructs):
             raise ValueError(
                 "Can't order un-orderable construct type: {!r}".format(self)

and the first print item is always {'cell_method'}, demonstrating that only cell method constructs seem to be able to be ordered by the method. But that doesn't seem right, especially as this is a generic construct method as implied by the docstring?

It appears that, once same-type constructs are input, it reaches final logic that only deals with cell method constructs because only they get added to the _ordered_constructs instance attribute dict in the line:

self._ordered_constructs.add("cell_method")

Note that this behaviour gets passed downstream and also manifests in cf-python (the initial code snippet above has the same results when cf is substituted for cfdm).

@davidhassell it would be useful to hear you thoughts: should other types of constructs be processed and if they should, is it the case, as I suspect and as implied by this line in the docstring:

cfdm/cfdm/core/constructs.py

Lines 1205 to 1206 in e908dc8

For cell method constructs, the predetermined order is that in
which they where added.

that cell methods should be treated as a special case such that some logic is missing to handle the ordering of all other constructs? Thanks.

Logging enhancements

Follow-on from #31. Now the infrastructure for logging is in place, we can make good use of it with some extensions [feel free to add to this listing, anyone!]:

  • #31 did not add new logging calls, it simply replaced print calls existing at that point, which only existed on any equals methods & read or write functions. We should:

    • add meaningful messages across the codebase at applicable levels;
    • add a verbose kwarg to any function that ends up with a significant amount of log calls.
  • Improved display in log calls of objects for readability. For example, using pprint.pformat to print dictionary or list structures with an item per line, to make it easy to pick out a particular item especially where the structure has many items so it would otherwise be difficult. I have already changed a few messages in read_write.netcdf.netcdfread as such, e.g:

    logger.detail(
    " Global attributes:\n" +
    pformat(g['global_attributes'], indent=4)
    ) # pragma: no cover

  • Currently log messages go, as the equivalent print() statements did, to STDOUT as pure messages i.e. no extra metadata such as datetime stamps are included, just log level, logger name, & the message itself. However I think it would be beneficial to have at least one new handler to provide (all logging messages plus) (date)time stamps, calls made by the user, and exceptions which get raised outside of the logging system.

    I think a file handler is best such that the user can specify a path where a named dedicated log file gets written out (& rolled if there is the potential for it to get large enough) with every possible detail (i.e. set cfdm.LOG_LEVEL('DEBUG') for that handler). This would be great for user support & debugging purposes.

    • for this handler, add functionality to print timestamped calls that are run by the user (not just the log messages themselves).

Docs: highlight full method for '[source]' link

Instead of linking just to the first line of the relevant method from a '[source]' link in the API reference of the documentation, it would be better to link to the full method. In other words, the link would go to a page for the relevant module in the codebase and have highlighted multiple lines covering the extent of the method, rather than just the one with the def declaring it.

Thanks to @sadielbartholomew for fixing this in cf-python - the solution will be the same her.e

PEP8 compliance: review (& justify chosen) exclusions

For cfdm as for cf-python defined in an equivalent issue NCAS-CMS/cf-python#83; the cfdm codebase is now PEP8-compliant under the interpretation & scope of the pycodestyle library, with the exception of several rules I have explicitly excluded (where there is no easy way to exclude instead on a per-case/line basis, sadly). We should review these exclusions, in this case being:

# These are pycodestyle errors and warnings to explicitly ignore. For
# descriptions for each code see:
# https://pep8.readthedocs.io/en/latest/intro.html#error-codes
pep8_check.options.ignore += ( # ignored because...
'W605', # ...false positives on regex and LaTeX expressions
'E272', # ...>1 spaces to align keywords in long import listings
'E402', # ...justified lower module imports in {.., core}/__init__
'E501', # ...docstring examples include output lines >79 chars
'E722', # ...lots of "bare except" cases need to be addressed
)

& decide whether pycodestyle is the right tool, among many options, for our requirements on linting.

Failure when writing field with converted datatype

When converting the data type of an output array with the datatype keyoword of cfdm.write, the _FillValue is not converted, leading to a netCDF error:

>>> import cfdm
>>> f = cfdm.example_field(1)
>>> f.set_property('_FillValue', 45.)
>>> cfdm.write(f, 'delme.nc', datatype={numpy.dtype('float64'): numpy.dtype('float32')})
<snip>
AttributeError: NetCDF: Not a valid data type or _FillValue type mismatch

The solution is simply to make sure that _FillValue and missing_value attributes are converted if required.

Verify support for Python 3.9 (once it is more established)

As per the description in NCAS-CMS/cfunits#15, though in this case I have not yet tried to add 3.9 jobs to the workflow to run the test suite (if setting up the environment for cfunits fails at the moment, it certainly will do the same for cfdm).

In short: in late 2020 or early 2021 once 3.9 is more established and probably supported by at least some dependencies, we should check whether our dependencies allow us to support 3.9, and document (perhaps package) whether or not 3.9 is supported.

Bug when setting datum and coordinate conversion parameters

When setting datum and coordinate conversion parameters via the coordinate reference construct attributes, the new settings do not appear in the parent coordinate reference construct:

In [1]: import cfdm                                                                                          

In [2]: cr = cfdm.CoordinateReference()                                                                      

In [3]: cr.datum.set_parameter('test', 123)                                                                  

In [4]: cr.dump()                                                                                            
Coordinate Reference: 

Decorator to administer display keyword argument

This is very low on the prioritisation list, but I've noticed there are several cases of logic to either return or print a constructed string keyword parameter depending on the value of a Boolean parameter display, essentially (in the most common case) the following:

def <method name>(<args, other kwargs>, display=True):
    <...
    construct 'string' var
    ....>

    string = '\n'.join(string)

    if display:
        print(string)
    else:
        return string

and this would be a great candidate for logic to apply via a decorator. It should be fairly to implement such a decorator, too.

Custom exceptions

cfdm does not use any custom exceptions, other than DeprecationError in mixin.netcdf. We could likely make some classes a bit cleaner & improve user feedback on errors etc. by creating & applying some custom exception classes for which to delegate some error handling.

(Copied, minus noise & with a little tidying, from #64 (comment))

Ideas

I can flesh this out further e.g. with some potential inheritance structure as we think about it & work out what might be useful, but firstly to record some potential candidates for useful exceptions:

  • cfdm implementation errors (relating to attempts to subclass);
  • CF compliance warnings (as specific forms of Python warnings lib warnings that the user can opt to enable or disable by interfacing with the logging framework);
  • construct i.e. metadata errors:
    • missing i.e. not defined when expected/required;
    • mismatched dimensions or shape;
    • invalid;
    • etc.
  • data errors, e.g perhaps:
    • array-related;
    • compression-related;
  • field construct operation errors:
    • on reading fields;
    • on writing fields;
    • on aggregation;
  • errors related to a specific encoding, notably netCDF.

Issue a warning if reading and writing datasets that have valid_[min|max|range] attributes

The presence of the the valid_min, valid_max or valid_range attributes causes data to be masked for its "out-of-range" values. This is sometime surprising, particularly if the data has been modified between the read and write operations.

Proposal:

Add a warn_valid keyword parameter, default True to cfdm.read and cfdm.write that warns when such properties are present. The cfdm.write case will only warn if the data contains out-of-range values. Since data is not actually read by cfdm.read (lazy loading), it is not possible to check for out-of-range data during the read, so merely the presence of any of these properties will trigger the warning.

If warn_valid=False then warnings are suppressed.

Bug: Failure when writing a dataset that contains a scalar domain ancillary construct

>>> import cfdm
>>> cfdm.environment(paths=False)
Platform: Linux-5.4.0-53-generic-x86_64-with-debian-bullseye-sid
HDF5 library: 1.10.5
netcdf library: 4.6.3
python: 3.7.0
netCDF4: 1.5.4
cftime: 1.3.0
numpy: 1.18.4
netcdf_flattener: 1.2.0
cfdm: 1.8.7.0
>>> import cfdm; f = cfdm.read('~/thetao_Omon_MIROC6_abrupt-4xCO2_r1i1p1f1_gn_344001-344912.cdl')[0]
>>> cfdm.write(f, 'delme.nc')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-e145e8c98469> in <module>()
----> 1 cfdm.write(f, 'delme.nc')

~/cfdm/cfdm/read_write/write.py in write(fields, filename, fmt, overwrite, global_attributes, variable_attributes, file_descriptors, external, Conventions, datatype, least_significant_digit, endian, compress, fletcher32, shuffle, string, verbose, warn_valid, group, coordinates, _implementation)
    460                      string=string, verbose=verbose,
    461                      warn_valid=warn_valid, group=group,
--> 462                      coordinates=coordinates, extra_write_vars=None)

~/cfdm/cfdm/decorators.py in verbose_override_wrapper(self, *args, **kwargs)
    184         # enabling
    185         try:
--> 186             return method_with_verbose_kwarg(self, *args, **kwargs)
    187         except Exception:
    188             raise

~/cfdm/cfdm/read_write/netcdf/netcdfwrite.py in write(self, fields, filename, fmt, overwrite, global_attributes, variable_attributes, file_descriptors, external, Conventions, datatype, least_significant_digit, endian, compress, fletcher32, shuffle, scalar, string, extra_write_vars, verbose, warn_valid, group, coordinates)
   4466         # ------------------------------------------------------------
   4467         for f in fields:
-> 4468             self._write_field(f)
   4469 
   4470         # ------------------------------------------------------------

~/cfdm/cfdm/read_write/netcdf/netcdfwrite.py in _write_field(self, f, add_to_seen, allow_data_insert_dimension)
   3390         for key, anc in sorted(
   3391                 self.implementation.get_domain_ancillaries(f).items()):
-> 3392             self._write_domain_ancillary(f, key, anc)
   3393 
   3394         # ------------------------------------------------------------

~/cfdm/cfdm/read_write/netcdf/netcdfwrite.py in _write_domain_ancillary(self, f, key, anc)
   2192 
   2193             # Create a new domain ancillary variable
-> 2194             self._write_netcdf_variable(ncvar, ncdimensions, anc)
   2195 
   2196         g['key_to_ncvar'][key] = ncvar

~/cfdm/cfdm/read_write/netcdf/netcdfwrite.py in _write_netcdf_variable(self, ncvar, ncdimensions, cfvar, omit, extra, fill, data_variable)
   2533         if g['group']:
   2534             groups = self._groups(ncvar)
-> 2535             for ncdim in ncdimensions:
   2536                 ncdim_groups = self._groups(ncdim)
   2537                 if not groups.startswith(ncdim_groups):

TypeError: 'NoneType' object is not iterable

Request: Allow global constants to be controlled by a context manager

This so we can stop doing things like:

>>> old = cfdm.log_level('DEBUG')
>>> <execute some code>
>>> cfdm.log_level(old)

and start doing things like:

>>> with cfdm.log_level('DEBUG'):
...     <execute some code>
...
>>>

Implementation

The getter/setter function for each constant will have to return a Constant object, rather than the constant itself, which defines the __enter__ and __exit__ methods. The existing functions will be happy taking as input either their current value type (e.g. str) or new Constant instance - i.e. all existing code will still work the same.

PR to follow.

Revise NetCDFRead messaging & _code{0, 1} class vars

The pair of strings input to _add_message throughout NetCDFRead must exist as keys respectively in the _code0 and _code1 class variable dicts:

_code0 = {
# Physically meaningful and corresponding to constructs
'Cell measures variable' : 100,
'cell_measures attribute': 101,
'Bounds variable' : 200,
'bounds attribute' : 201,
'Ancillary variable': 120,
'ancillary_variables attribute': 121,
'Formula terms variable': 130,
'formula_terms attribute': 131,
'Bounds formula terms variable': 132,
'Bounds formula_terms attribute': 133,
'Auxiliary/scalar coordinate variable': 140,
'coordinates attribute': 141,
'grid mapping variable': 150,
'grid_mapping attribute' : 151,
'Grid mapping coordinate variable': 152,
'Cell method interval': 160,
'External variable': 170,
# Purely structural
'Compressed dimension': 300,
'compress attribute': 301,
'Instance dimension':310,
'instance_dimension attribute':311,
'Count dimension': 320,
'count_dimension attribute': 321,
}
_code1 = {
'is incorrectly formatted': 2,
'is not in file': 3,
'spans incorrect dimensions': 4,
'is not in file nor referenced by the external_variables global attribute': 5,
'has incompatible terms': 6,
'that spans the vertical dimension has no bounds': 7,
'that does not span the vertical dimension is inconsistent with the formula_terms of the parametric coordinate variable': 8,
'is not referenced in file': 9,
'exists in the file': 10,
'does not exist in file': 11,
'exists in multiple external files': 12,
'has incorrect size': 13,
'is missing': 14,
'is not used by data variable': 15,
'not in node_coordinates': 16,
}

else a KeyError will be thrown obscuring the true message we want to provide to the end-user via that method:

if message is not None:
code = self._code0[message[0]]*1000 + self._code1[message[1]]

On more than one occasion now the strings were not present as keys in those dictionaries when they should have been & this has led to bugs for reading netCDF, e.g. as fixed in 201ba62 (for which I checked every message component is present as it should be, & will do a double check shortly, but we should consider the scenario where new messages will likely be added in development).

Moreover there are some secondary issues:

  • the codes as values in _code{0,1} are currently arbitrary (I believe), awaiting decisions on potential standardisation as error/warning codes under the CF Conventions;
  • some of the messages are arguably saying the same thing e.g. 'is not in file': 3 & 'does not exist in file': 11 are perhaps interchangeable.

As discussed recently, for the above reasons we might want to re-consider how to implement the messaging. I think we want to preserve the two-component standardisation encapsulated by a combination of _code0 and _code1 keys, as we have at present, but in a way that also:

  • makes it impossible or at least less likely that there can be bugs, e.g. by not requiring duplication of messages as keys in those dicts which must be cross-referenced to keep consistent.
  • allows us to test the messaging directly within the test suite, without having to have cases of netCDF files with each corresponding issue to hit the logic with all call to _add_message.

Container copy method implies deep copy behaviour

Initial work towards doctesting has implied, and after further investigation I can confirm, that the copy method for the ABC Container (i.e. cfdm.core.Container.copy) that is documented as being a deep copying operation is in fact only displaying the behaviour of a shallow copy.

For example, note how the setting of a _custom dict component of g is also reflected in f when it is an item within a container, appearing to be a reference rather than a copy of that item, but not reflected in ff when a simple object:

>>> # Setup
>>> import cfdm
>>> f = cfdm.core.abstract.container.Container()
>>> f._custom
{}
>>> f._custom['feature'] = ['f']
>>> f._custom
{'feature': ['f']}

# Apply the copy, expecting it to be deep
>>> g = f.copy()
>>> g._custom['feature'][0] = 'g'
>>> g._custom
{'feature': ['g']}

# ...but note how the change is also reflected in f:
>>> f._custom
{'feature': ['g']}

# ...though changing the top-level value for g does not influence f:
>>> g._custom['feature'] = 'gee whiz'
>>> g._custom
{'feature': 'gee whiz'}
>>> f._custom
{'feature': ['g']}

Environment

>>> cfdm.environment(paths=False)
Platform: Linux-4.15.0-54-generic-x86_64-with-glibc2.10
HDF5 library: 1.10.6
netcdf library: 4.7.4
Python: 3.8.5
netCDF4: 1.5.4
numpy: 1.19.4
cfdm.core: 1.8.8.1
cftime: 1.3.0
netcdf_flattener: 1.2.0
cfdm: 1.8.8.1

Add logging to manage & configure verbose flag logic

Equivalent to NCAS-CMS/cf-python#37. We have decided to add logging here (to cfdm) first as cf-python builds on top of it so as a base it is a more natural starting point, & there is less code to cover to fully implement the logging as a POC for possible adjustment.

We have also decided the logging should be promoted as a user feature for configurable feedback, not just as a developer aid.

We will go with Python's standard logging module as it is excellent & certainly sufficient for the requirements here.

Docs: immobile navigation menu could be cut off

The navigation menu (left pane) of the documentation seems to have gotten a little longer with some new sections, & with the screen size of my (work) laptop has become just long enough to extend off the screen vertically:

cfdm-menu-long-scr

Nothing important is cut off at the moment with my screen view as an example, but whilst I know that is close to the end of the menu, users may not & may be frustrated at not being able to access the later parts of the menu without difficulty (zoom out etc.). It is not possible to scroll as there is no scrollbar (which is a style choice I agree with) & even forcing it by highlighting the text as works with the readthedocs theme doesn't work in this theme.

To alleviate potential for such problems, I suggest move some of the top-level headings under a more general page? 'Subclassing', 'Philosophy', 'Performance' & 'Versioning' are good candidates as they are all small pages & aren't especially standard docs sections. Maybe the former three could live under an umbrella 'Implementation' heading to go after the 'CF Data Model' section in listing?

Overall there are various approaches that would solve this, but I personally would rather not add a scrollbar to the nav menu, or change to a nested tree type of menu, as the menu looks very clean & clear as it is (just a little too long).

Setting a vertical coordinate reference system

I've just started using cfdm ... still finding my way around the many classes. I'm trying to implement a vertical coordinate reference system for an atmosphere_hybrid_height_coordinate .. using the "more complete" example in the tutorial (https://ncas-cms.github.io/cfdm/1.7.1/tutorial.html ) .. which shows all the structures I need. In my code I get an error message (copied below) about an unexpected argument. I can reproduce the error if I take your script from the tutorial (which works as it is) and comment out the line "tas.set_construct(horizontal_crs)". Then, as in the script I want to create, you only have a vertical coordinate reference. The script still executes fine. tas.dump() also works as expected, but cfdm.write( "tas.nc", tas ) produces the following:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/cfdm/read_write/write.py", line 357, in write
    verbose=verbose)
  File "/usr/local/lib/python3.5/dist-packages/cfdm/read_write/netcdf/netcdfwrite.py", line 3402, in write
    self._write_field(f)
  File "/usr/local/lib/python3.5/dist-packages/cfdm/read_write/netcdf/netcdfwrite.py", line 2699, in _write_field
    self._create_vertical_datum(ref, owning_coord_key)
  File "/usr/local/lib/python3.5/dist-packages/cfdm/read_write/netcdf/netcdfwrite.py", line 2853, in _create_vertical_datum
    datum=self.implementation.get_datum(ref))
TypeError: initialise_CoordinateReference() got an unexpected keyword argument 'coordinates'

Implement the new Domain construct for CF-1.9

The new CF domain variable (cf-convention/cf-conventions#301) requires a CF data model domain construct. In the existing data model, the domain is represented by an abstract Domain concept, but the new CF-netCDF domain variable "promotes" the domain to construct status, on a similar footing to the Field class.

Implementation notes:

  • Domain variables can exist independently of data variables, and so cfdm needs to be able to read them from netCDF files.
  • Climatological time bounds in a Domain do not have access to any cell methods
  • The new Domain construct will need much of the functionality that currently only exists for a Field construct, so refactoring is required to maximise code reuse.

Mapping of metadata-only CDL to field constructs

NCAS-CMS/cf-python#197 has highlighted that cfdm as-is is not able to interpret as definite field(s) some cases of CDL inputs which provide only schema and/or coordinate information, e.g. as produced from a ncdump -h or ncdump -c. We realised it may not in fact be possible given the nature of the CF data model to unambiguously map CDL with metadata but no data arrays provided onto fields, though we aren't certain.

The direct problem resulting from this is that such missing data is not accounted for so errors may emerge when such CDLis read-in.

Long-term solution

Ultimately we should:

  1. determine whether header- and coordinate-based CDL can be mapped conclusively to field constructs,
  2. a) if so, adapt the logic so that missing data is dealt with appropriately and does not emerge downstream as errors for reading of such CDL, both here in standalone cfdm or indeed in dependencies like cf-python, like as for NCAS-CMS/cf-python#197;
    b) if not so, decide on the best approach for a result of reading in such CDL, whether that is to provide some sensible "best guess" field encapsulating the known information, or to tell users via appropriate error message that the CDL can't be read-in as fields for these reasons.

Short-term fix

For the forthcoming release, to address NCAS-CMS/cf-python#197, I am catching the MaskErrors to raise them as a user-friendly message stating that the CDL metadata is insufficient for conversion to field constructs, i.e. assuming case (b) and raising a ValueError. This is sufficient for the release but should be re-evaluated in the longer term.

Even if we end up going with this approach after the review, I would like to create a custom error class to raise for related errors, rather than using a Python built-in Exception.

Zero-sized unlimited dimension when read from a grouped netCDF file

The file cm4twc_dump_file.nc contains subgroups and has an unlimited dimension, currently of size 13. However, when read it gives the unlimited dimension as size 0:

>>> cfdm.read('cm4twc_dump_file.nc')[0]
<Field: transfer_i(time(0), altitude(1), latitude(4), longitude(3)) 1>

And downstream errors occur, e.g. when trying to subspace the size zero dimension.

This is a bug in cfdm.read (well cfdm.read_write.netcdf.NetCDFRead to be exact), which takes the dimension sizes from the flattened version of the file, but the flattened file does not know the unlimited dimension size, because the flattened file contains no arrays.

This is easily fixed by getting the dimension size from the original grouped file instead. PR to follow.

>>> cfdm.environment(paths=False)
Platform: Linux-5.4.0-62-generic-x86_64-with-debian-bullseye-sid
HDF5 library: 1.10.5
netcdf library: 4.6.3
Python: 3.7.0
netCDF4: 1.5.4
numpy: 1.18.4
cfdm.core: 1.8.8.0
cftime: 1.3.0
netcdf_flattener: 1.2.0
cfdm: 1.8.8.0

Accurate and doctest-able `cfdm.core` docstring examples

Most of the docstring code examples within cfdm.core modules suggest there will be cfdm-like user-friendly outputs, whereas running the code in fact returns outputs of the default Python object representation with the class name and CPython object id, like <cfdm.core.data.data.Data object at 0x7f80e9c7a3d0> or <cfdm.core.interiorring.InteriorRing object at 0x7f80e9636910>.

For example, compare:

>>> import cfdm.core
>>> d = cfdm.core.Data(range(10))
>>> c = cfdm.core.DimensionCoordinate()
>>> c.set_data(d)
>>> d
<cfdm.core.data.data.Data object at 0x7f833e063430>
>>> c
<cfdm.core.dimensioncoordinate.DimensionCoordinate object at 0x7f833cc90f70>
>>> c.get_data()
<cfdm.core.data.data.Data object at 0x7f833cc94130>

with the equivalent using the core.CoordinateReference:

>>> import cfdm
>>> d = cfdm.Data(range(10))
>>> c = cfdm.DimensionCoordinate()
>>> c.set_data(d)
>>> d
<Data(10): [0, ..., 9]>
>>> c
<DimensionCoordinate: (10) >
>>> c.get_data()
<Data(10): [0, ..., 9]>

where relevant docstring examples suggest the cfdm behaviour, e.g:

>>> d = {{package}}.Data(range(10))
>>> f.set_data(d)
>>> f.has_data()
True
>>> f.get_data()
<{{repr}}Data(10): [0, ..., 9]>
>>> f.del_data()
<{{repr}}Data(10): [0, ..., 9]>

This should be improved as really we want accurate docstrings not just for (and distinguishing between) cfdm.core and cfdm but also cf-python which inherits many docstrings for methods it does not overload. And ideally they will all be doctest-able ensure validity.

(I encountered this whilst reviewing the examples with the aim to incrementally ensure they are all appropriate and functionally sound via doctest, for which this issue has particular relevance).

We noted this was because cfdm.core does not have __repr__ methods defined by design, those being left for definition in cfdm.

Possible solutions

We agreed there are at least two potential solutions, namely:

  1. Move the __repr__ methods defined in the cfdm modules to cfdm.core, so the only difference between docstrings in cfdm.core, cfdm and cf for them to work in all cases is the package name, which is handled by {{package}} docstring substitutions.
  2. Change the cfdm.core docstrings to show the true outputs, i.e. the default Python object representation. That would be complicated by the fact those representations include a memory address which will usually change, but we could probably just replace those with an ellipsis as both a user-facing and understandable marker and a means recognised by doctest for ignoring certain text and assuming whatever lies there is acceptable.

Reconsider treatment of external cell measures

We want to evaluate the meaning and desired behaviour of external cell measures within the CF data model, for example towards a consistent and rational approach to upstream aggregation in cf-python.

Can not write to 'NETCDF3_64BIT_OFFSET' and 'NETCDF3_64BIT_DATA' format files

Can not write to 'NETCDF3_64BIT_OFFSET' and 'NETCDF3_64BIT_DATA' format files:

In [1]: import cfdm                                                                                      

In [2]: cfdm.environment()                                                                               
Platform: Linux-4.15.0-72-generic-x86_64-with-debian-stretch-sid
python: 3.7.3
future: 0.17.1 
HDF5 library: 1.10.2
netcdf library: 4.6.1
netCDF4: 1.4.2 
numpy: 1.16.2
cfdm: 1.7.11

In [4]: f = cfdm.read('cfdm/test/test_file.nc')                                                          

In [6]: cfdm.write(f, 'out.nc', fmt='NETCDF3_64BIT_OFFSET')
<snip>
ValueError: Unknown output file format: NETCDF3_64BIT_OFFSET

In [7]: cfdm.write(f, 'out.nc', fmt='NETCDF3_64BIT_DATA')                                                
<snip>
ValueError: Unknown output file format: NETCDF3_64BIT_DATA

This should be possible, as per the documentation in cfdm.write.

Further documentation on creating fields and writing fields to file

As stated in an email thread:

An overview of creation and writing near the top of the tutorial (where there is already a read overview) would be beneficial.

This is in response to a(n expert) user request:

a couple of tutorials on "how to create CF compliant data" would be really useful. We could tie them to useful examples, like: creating a multimodel ensemble dataset and "converting a grib file to cf compliant netcdf"

Query: setting the variable name for a grid_mapping

Hello David,

the script you gave me works fine now, and I've been able to modify it to adjust the names of variables in the output netcdf file by using either nc_set_variable or nc_set_dimension, with one exception: there is a grid_mapping variable which is generated with the name rotated_latitude_longitude and carries the datum: I can't find a construct which corresponds to this variable. Is there a way of changing the name-in-file of the grid_mapping variable?

Tests: sed setup lines invalid on Mac OS

As detected by the "Run test suite" GH Actions workflow for the latest commit, ce9a5a4, a sub-test fails on Mac OS but not on a Linux (Ubuntu) OS in the setup stage on an in-place sed command:

======================================================================
ERROR: test_read_CDL (test_read_write.read_writeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/runner/runners/2.169.0/work/cfdm/cfdm/cfdm/test/test_read_write.py", line 216, in test_read_CDL
    shell=True, check=True)
  File "/Users/runner/miniconda3/envs/cfdm-latest/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'sed -i "1 i\ \ " /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/tmp27bjmfaj.cfdm_test' returned non-zero exit status 1.

This post enlightens that Mac OS treats in-place sed commands slightly differently, requiring an extension to be specified.

It will be a quick fix that I will put in now, but raising an intermediate GH Issue as a note since it is likely we could run into this setup issue again.

Docs: some methods do not appear in API reference

Some methods are missing from the API reference, such that when (some) are cited elsewhere in the documentation they are linking not to the internal methods as intended but to the equivalently-named methods in the Python documentation, presumably via intersphinx. (If other methods were missing but not of same name as a built-in, they would not get linked, which is not misleading but still not ideal.)

I have noticed this at least for at least the max and min methods listed under the "see also" directive for certain classes, e.g. for Data.sum. This is despite Data.max being defined in the codebase and working just fine on some field's data.

Note I've looked into this briefly and I can see it's not an autodoc extension scoping issue since we set the module correctly in the templates, via (for the Data.sum case) the .. currentmodule:: cfdm in the method.rst template.

So we need to ensure all objects in the reference have all possible methods listed under one of the autosummary lists for their class under the class/ dir. I will do that now for max and min, since I have spotted them, but at release-time we should find some means to check that all non-private methods in the codebase are cited in an autosummary and hence are generated when the docs are built. Ideally we can find a Sphinx tool that can check that for us, else write a small script to check.

Docs: drop-down menu for same-page version switching

We've noted we'd like a collapsible drop-down menu (in the sidebar, for example) for selecting and changing the version of the documentation being shown, where equivalent pages are mapped across versions, rather than pages at specific versions needing to be accessed via their index pages as a starting point.

Such a menu is provided for any docs hosted with ReadTheDocs (at least covering limited aliased versions e.g. 'latest', 'stable'), but for self-hosted docs not using the theme sphinx_rtd_theme, a bit of manual configuration & templating seems to be required. It's definitely possible without too much difficulty or code, as I have seen from some examples (see some listed below), but it is not easy to trace the parts of the docs source and config that result in the versioning in each case.

Note that the completion of this can and should go hand-in-hand with addressing #28 (once a structure to process versions is in, it becomes trivial to add some new text to all pages of a certain version, by templating, and the versioning extensions below provide this as a configuration option) and since it relates to documentation customisation, it would be good to tackle #50 simultaneously also.

References

Some helpful resources I've found after a little investigation:

  • The pytest docs: see menu at bottom right which covers about 100 versions (note the theme looks like alabaster but is in fact a different one, with version menu processing seemingly coded up here).
  • A useful comment with code snippets illustrating how they have achieved something similar (for a different theme, but probably could work for alabaster with or without some modification): pydata/pydata-sphinx-theme#23 (comment)
  • some Sphinx extensions that might be able to help, but might be excessive depending on what can achieve the basic goal:

NetCDF string length dimension name methods

We noted it would be useful, and particularly after discussion arising related to NCAS-CMS/cf-python#69, to have methods that can get, set, delete and check for the existence of trailing string length dimensions.

With naming in line with methods in the existing API, the intuitive case would be for a set of four methods named nc_{get, set, del, has}_string_length_dimension.

Decorators enabling in-place operations

As already implemented in cf-python : NCAS-CMS/cf-python@b82f507 & the consecutive commits up to NCAS-CMS/cf-python@c45b7d9, plus NCAS-CMS/cf-python@441a2d6.

In this case, we might have to be careful to check the downstream behaviour in cf-python is unaffected, since methods using the in-place decorator inherited from subclassing by cf-python & then used inside its own methods with similar decorators could do something fatal like recurse, leading to a RecursionError: maximum recursion depth exceeded as I observed frequently during development of the equivalents in that library when I hadn't implemented super() cases correctly.

Improve cfdm.read performance

Profiling cfdm.read suggests that that significant performance improvements can be found by removing calls to pprint.pformat and eliminating unnecessary deep copies. Other optimisations (e.g. f-strings) can also be applied.

Reading the file test_file.nc that is produced by the test suite with new code gives a speed up of 40% (0.08s, 0.048s)

Docs: out-of-date warning on older versions

The first result I see upon a Google search for "cfdm documentation" links to an old version (1.7.1) of the documentation. However, there is no indication that it is not the latest & greatest.

Since it is likely, as this illustrates, that users may end up viewing older versions of the documentation inadvertently, we should add a visible & explicit warning that the pages & table of contents on display belong to an old version. Other libraries often do this, for example, NumPy displays:

ex1_numpy

& similarly Python displays:

ex1_python

The simplest way to do this would be to inject an RST warning directive (.. warning::) to the top of all content pages for older versions, as that directive provides a ready-made red text box seems to be the UI design trend for this, as above, & which will draw the necessary attention.

Read fails when a coordinate bounds variable is missing in the file

If a (scalar or auxiliary) coordinate variable has bounds referenced from it (typically with the bounds netCDF attribute) but the referenced bounds variable is not in the file, then a KeyError occurs. What should happen is that the non-compliance is logged and the read continues, creating a coordinate construct without bounds.

Support for CDL

It would be really useful to be able to read CDL files directly into cfdm, rather than having to first convert to binary netCDF files. Can this be added?

HTML table construct inspection in Jupyter Notebooks

IPython supports 'rich' display within Jupyter Notebooks (or see here for a great blog post about it), such that we could implement a _repr_html_ method in appropriate classes to output a real HTML table rather than the 'makeshift' tables we are constrained to returning in standard interpreter scenarios.

In particular, this would be beneficial to implement for any non-minimal-detail inspection call with a construct, e.g. for a field print(f) & f.dump(), as they can output a lot of information & we want it to be as easy as possible for users to pick out what they are interested in.

As well as the obvious separation of components in the output, with HTML tables you get basic cell shading & lines & bold text to make the output easier to digest. If we really wanted to push the boat out, we could even implement something more sophisticated to make rows or groups of them collapsible, as per the xarray example in the blog post linked above.

Demo

As a demonstration, I've coded up a basic tabular output for the minimal detail inspection of a field via (i.e. repr -> _repr_html_ for the field in notebooks). I used it simply to get a basic example to show and note I think a table is overkill for this context in practice; really I want to tabularise similarly the str and dump representations. The result (Out[3]):

table-cfdm-example

is produced by this example method inside the Field class:

def _repr_html_(self):
    """
    Outputs a HTML table representation within Jupyter notebooks.
    """
    # HTML tags to use to compose the table in HTML
    blank_table = '<table style="width:50%">{}</table>'
    blank_row_container = "<tr>{}</tr>"
    heading_row_content = "<th colspan='{}'>{}</th>"
    data_row_content = "<td>{}</td>"

    # Extract some info as processed otherwise into one_line_description
    x = [self._unique_domain_axis_identities()[axis] for axis in
         self.get_data_axes(default=())]
    axes_rows = [data_row_content.format(data) for data in x]

    # Construct and populate table
    type_of_construct = heading_row_content.format(
        1, str(self.__class__.__name__) + ":")
    identity_info = heading_row_content.format(
        len(axes_rows) - 1,
        "{} (units of {})".format(
            self.identity(''),
            self.get_property('units', None)
        )
    )
    heading_row = blank_row_container.format(
        type_of_construct + identity_info)

    return blank_table.format(heading_row + "".join(axes_rows))

Decisions to make

If we think this is a good idea, we should consider:

  • whether it is best to put the relevant methods here in cfdm, or in cf-python;
  • which inspection cases to implement a _repr_html_ for;
  • what format we want produced table outputs to be in each case (I think it best to develop a mock-up before coding any method up).

Remove Python 2.7 support

As of version 1.8.5, cfdm no longer works with Python 2.7, due to API changes in the logging package (#35).

The implementation of netCDF groups (#13) will require the import of the netcdf-flattener library, which is Python 3 only.

Therefore Python 2.7 support will be formally withdrawn at the next release: 1.8.6

Dictionary view objects from methods to get constructs &/or keys

The docstring description for the items, keys and values methods for Constructs instances implies they return dicts or lists, as they would have in Python 2, instead of the dict-like and list-like objects dict_items, dict_keys or dict_values that are returned in Python 3:

def items(self):
'''Return the items as (construct key, construct) pairs.
.. versionadded:: (cfdm) 1.7.0
.. seealso:: `get`, `keys`, `values`
'''
return self._dictionary().items()
def keys(self):
'''Return all of the construct keys, in arbitrary order.
.. versionadded:: (cfdm) 1.7.0
.. seealso:: `get`, `items`, `values`
'''
return self._construct_type.keys()
def values(self):
'''Return all of the metadata constructs, in arbitrary order.
.. versionadded:: (cfdm) 1.7.0
.. seealso:: `get`, `items`, `keys`
'''
return self._dictionary().values()

for example:

>>> a = cfdm.example_field(0)
>>> a.constructs.items()
dict_items([('dimensioncoordinate0', <DimensionCoordinate: latitude(5) degrees_north>), ('dimensioncoordinate1', <DimensionCoordinate: longitude(8) degrees_east>), ('dimensioncoordinate2', <DimensionCoordinate: time(1) days since 2018-12-01 >), ('domainaxis0', <DomainAxis: size(5)>), ('domainaxis1', <DomainAxis: size(8)>), ('domainaxis2', <DomainAxis: size(1)>), ('cellmethod0', <CellMethod: area: mean>)])
>>> a.constructs.keys()
dict_keys(['domainaxis0', 'domainaxis1', 'domainaxis2', 'dimensioncoordinate0', 'dimensioncoordinate1', 'dimensioncoordinate2', 'cellmethod0'])
>>> a.constructs.values()
dict_values([<DimensionCoordinate: latitude(5) degrees_north>, <DimensionCoordinate: longitude(8) degrees_east>, <DimensionCoordinate: time(1) days since 2018-12-01 >, <DomainAxis: size(5)>, <DomainAxis: size(8)>, <DomainAxis: size(1)>, <CellMethod: area: mean>])

Given this is very likely a change resulting from the Python 2 to 3 port, I want to raise this to check what we want to return before fleshing out the return type in the documentation for further clarity. Do we want to return these view objects, or dicts or lists as applicable by converting via dict() or list() before returning?

I think the view objects are preferable because they are iterators so more efficient for iterations which seem to be by far the most common use case. But I see we also have an __iter__ method available to use if iterators are required, so maybe we do want to return the standard Python structures instead(?):

def __iter__(self):
'''Called when an iterator is required.
x.__iter__() <==> iter(x)
.. versionadded:: (cfdm) 1.7.0
'''
return iter(self._dictionary().keys())

Unless there is a requirement to return the alternative, I suggest we stick with returning the view objects but in the docstrings clarify that dict (list, as appropriate) can be called to convert from the view objects if a dict (list) is strictly required.

Missing values not being read correctly afer being written

>>> f.data.array
[[[-- -- -- -- -- -- -- -- --]
  [9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0 17.0]]]
>>> cfdm.write(f, 'file.nc')
>>> g = cfdm.read('file.nc')[0]
>>> g.data.array
[[[9.96920997e+36 9.96920997e+36 9.96920997e+36 9.96920997e+36
   9.96920997e+36 9.96920997e+36 9.96920997e+36 9.96920997e+36
   9.96920997e+36]
  [9.00000000e+00 1.00000000e+01 1.10000000e+01 1.20000000e+01
   1.30000000e+01 1.40000000e+01 1.50000000e+01 1.60000000e+01
   1.70000000e+01]]]

This only happens for netCDF4 formats. NetCDF3 formats are OK.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.