Giter Site home page Giter Site logo

nansencenter / py-thesaurus-interface Goto Github PK

View Code? Open in Web Editor NEW
2.0 11.0 3.0 864 KB

An interface to metadata conventions for geospatial data

License: GNU General Public License v3.0

Python 99.12% Shell 0.88%
dif nasa keyword gcmd earth-observations standards

py-thesaurus-interface's Introduction

Run unit tests and build Python package Coverage Status

py-thesaurus-interface

An interface to metadata vocabularies for geospatial and other geophysical data

Install

pip install https://github.com/nansencenter/py-thesaurus-interface/archive/master.tar.gz

Usage

import pythesint as pti
pti.get_gcmd_instrument('MERIS')
pti.get_gcmd_platform('ENVISAT')
pti.get_gcmd_provider('NERSC')

For JSON vocabularies, it is possible to update the local files using a specific version. The format of the version depends on the provider. If no version is provided, the latest version is retrieved.

import pythesint as pti
pti.update_gcmd_instrument(version='9.1.5')
pti.update_cf_standard_name(version='20210119T185509')
pti.update_mmd_platform_type(version='a5c8573')

or

import pythesint as pti
pti.update_all_vocabularies(versions={
    'gcmd_instrument': '9.1.5',
    'cf_standard_name': '20210119T185509',
    'mmd_platform_type': 'a5c8573'
})

Standards

The package follows the standards defined at NASA's Global Change Master Directory (GCMD) (http://gcmd.gsfc.nasa.gov) and the NetCDF-CF conventions (http://cfconventions.org/), plus possibly others that will be added as needs emerge... The standards are mapped in Python dictionaries and saved to json-files.

Directory Interchange Format (DIF)

The DIF format is a descriptive and standardized format for exchanging information about scientific data sets. The py-thesaurus-interface package provides an interface to the keywords and formats defined at GCMD.

See: Directory Interchange Format (DIF) Writer's Guide, 2015. Global Change Master Directory. National Aeronautics and Space Administration. [http://gcmd.nasa.gov/add/difguide/].

Controlled keyword vocabularies from GCMD

See: Global Change Master Directory (GCMD). 2015. GCMD Keywords, Version 8.1. Greenbelt, MD: Global Change Data Center, Science and Exploration Directorate, Goddard Space Flight Center (GSFC) National Aeronautics and Space Administration (NASA). URL:http://gcmd.nasa.gov/learn/keywords.html

py-thesaurus-interface's People

Contributors

akorosov avatar aleksandervines avatar aperrin66 avatar azamifard avatar chhorvat avatar mortenwh avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

py-thesaurus-interface's Issues

CF specific code should not be in Vocabulary.find_keyword

Regarding this code:

        # OBS: This works for the gcmd keywords but makes no sense for the cf
        # standard names - therefore always search the cf standard names by
        # standard_name only..
        for m in matches:
            remaining = {}
            for i in ii:
                remaining[keys[i]] = m[keys[i]]
            if not any(val for val in remaining.itervalues()):
                return m

cf_vocabulary should rather override the find_keyword method, or the find_keyword method should be made more generic and then overriden.

I vote for generic and overriden, consider the following scenario:
What if there are multiple matches? e.g. I search for "Imaging Radars"? Then it should return all matches (currently it returns nothing, which I think is a bug).

Now, if it would return a list of all matches, then cf_vocabulary could override this, by calling the find_keyword from superclass, and then perform the search only on standard_name only.

BUT - why are we only searching standard names? I think there should be an option here. Perhaps I don't know what standard name to use, and want to search in the description also?

Changes in GCMD platform data structure

The fields returned by the GCMD webservice for platforms have changed for all versions of the vocabulary.

It went from: Category, Series_Entity, Short_Name, Long_Name
to: Basis, Category, Sub_Category, Short_Name, Long_Name

As a result, the get_ and search_ function do not work anymore for platforms no matter which vocabulary version is specified.
We need to check all other vocabularies and adapt the code to work with the new fields, and decide whether the interface for platforms should change. Do we keep returning objects with the old fields, ensuring compatibility with existing software, or do we use the new fields, in which case all software using pythesint needs to be updated.

can't install on ubuntu (/usr/bin/python)

When I do the command:
pip install https://github.com/nansencenter/py-thesaurus-interface/archive/master.tar.gz

I get the error:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-t97wYu-build/setup.py", line 84, in
cmdclass = {'install_scripts': update_vocabularies}
File "/usr/lib/python2.7/distutils/core.py", line 111, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 269, in init
self.fetch_build_eggs(attrs['setup_requires'])
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 313, in fetch_build_eggs
replace_conflicting=True,
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 826, in resolve
dist = best[req.key] = env.best_match(req, ws, installer)
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1092, in best_match
return self.obtain(req, installer)
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1104, in obtain
return installer(requirement)
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 380, in fetch_build_egg
return cmd.easy_install(req)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 663, in easy_install
return self.install_item(spec, dist.location, tmpdir, deps)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 693, in install_item
dists = self.install_eggs(spec, download, tmpdir)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 873, in install_eggs
return self.build_and_install(setup_script, setup_base)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1101, in build_and_install
self.run_setup(setup_script, setup_base, args)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1087, in run_setup
run_setup(setup_script, args)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 246, in run_setup
raise
File "/usr/lib/python2.7/contextlib.py", line 35, in exit
self.gen.throw(type, value, traceback)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 195, in setup_context
yield
File "/usr/lib/python2.7/contextlib.py", line 35, in exit
self.gen.throw(type, value, traceback)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 166, in save_modules
saved_exc.resume()
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 141, in resume
six.reraise(type, exc, self._tb)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 154, in save_modules
yield saved
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 195, in setup_context
yield
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 243, in run_setup
DirectorySandbox(setup_dir).run(runner)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 273, in run
return func()
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 242, in runner
_execfile(setup_script, ns)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 46, in _execfile
exec(code, globals, locals)
File "/tmp/easy_install-12gS2d/xdg-2.0.0/setup.py", line 19, in
REQS = ['PyYAML', 'requests', 'xdg;platform_system!="Windows"']
File "/tmp/easy_install-12gS2d/xdg-2.0.0/setup.py", line 12, in read_long_description
from setuptools import setup, find_packages
AttributeError: 'PosixPath' object has no attribute 'read_text'

Take config out of package

Well, at least add option to have it in a "user-controlled" location.

E.g. on unix/linux have a .pythesintrc.yaml file in home directory that is checked before loading default config from resource_string(__name__, 'pythesintrc.yaml')

Conda package for Python 3.7

Vagrant VMs cannot install python 3.7 because we don't have a conda package for that version in pythesint.

To do:

  • Make Conda package with Python 3.7

Link http://gcmd.gsfc.nasa.gov appears to be dead in README

In file README.md

On line 43 the link http://gcmd.gsfc.nasa.gov appears to be dead.
Web archive show the last update on 29-March 2022 redirecting to https://web.archive.org/web/20220329102926/https://idn.ceos.org/ where the site https://idn.ceos.org/ does appear to be functional.

A DuckDuckGo search for the link returns top four:

Issue with Python 35

Something is not working on python 3.5. One test fails on all platforms (but it works for both python 2.7 and 3.6)

.............................F...
======================================================================
FAIL: test_find_keyword (pythesint.tests.test_vocabulary.VocabularyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/vagrant/shared/pythesint_vm/py-thesaurus-interface/pythesint/tests/test_vocabulary.py", line 52, in test_find_keyword
    self.assertEqual(vocab.find_keyword('Animal'), self.animal)
AssertionError: OrderedDict([('Type', 'Cat'), ('Category', 'Animal'), ('Name', '')]) != OrderedDict([('Type', ''), ('Category', 'Animal'), ('Name', '')])

----------------------------------------------------------------------
Ran 33 tests in 44.271s

Running update_all_vocabularies() on package install can be problematic

Right now, update_all_vocabularies() is run every time the package is installed.
If something goes wrong during the execution, the installation of the package fails.

This kind of failure is often caused by a change on the source of one of the vocabularies, which can be solved quite easily by changing the version of the source that is used. But if the installation fails, it is impossible to try that.

There would be several solutions to fix that:

  • catching exceptions in the installation script to avoid failing (not a good solution in my opinion)
  • make it possible to disable the update on install (using env var or CLI argument)
  • call update_all_vocabularies() on import instead of on install (only if there are no local files present, to avoid calling it every time)

What do you think @akorosov?

Windows issue

There seems to be a Python 3 issue with Byte-string's on the response data.

Cannot import project keyword at python 3

I have got a problem with getting a project keywords from pythesint with python 3.6 (nansat VM). It works fine with py2 as well as with any other keywords (platforms, instruments) with py3.

>>> import pythesint as pti
>>> pti.get_gcmd_project('GHRSST')

Traceback (most recent call last):
  File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2961, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-8-29c8ed8ce8a4>", line 1, in <module>
    pti.get_gcmd_project('GHRSST')
  File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/site-packages/pythesint/vocabulary.py", line 45, in find_keyword
    for d in self.get_list():
  File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/site-packages/pythesint/json_vocabulary.py", line 17, in get_list
    return self.sort_list(json.load(open(self.get_filepath())))
  File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/json/__init__.py", line 299, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

can't install on ubuntu (miniconda/conda-forge)

When I do the command
pip install https://github.com/nansencenter/py-thesaurus-interface/archive/master.tar.gz

I get the error:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-MVU4Zj-build/setup.py", line 84, in
cmdclass = {'install_scripts': update_vocabularies}
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/site-packages/setuptools/init.py", line 129, in setup
return distutils.core.setup(**attrs)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/core.py", line 151, in setup
dist.run_commands()
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/site-packages/setuptools/command/install.py", line 61, in run
return orig.install.run(self)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/command/install.py", line 575, in run
self.run_command(cmd_name)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/tmp/pip-MVU4Zj-build/setup.py", line 31, in run
import pythesint as pti
File "pythesint/init.py", line 2, in
from pythesint.pythesint import update_all_vocabularies
File "pythesint/pythesint.py", line 39, in
_process_config()
File "pythesint/pythesint.py", line 24, in _process_config
fromlist=[name])
File "pythesint/wkv_vocabulary.py", line 6, in
from pythesint.json_vocabulary import JSONVocabulary
File "pythesint/json_vocabulary.py", line 7, in
from pythesint.pathsolver import DATA_HOME
File "pythesint/pathsolver.py", line 5, in
from xdg import (XDG_DATA_HOME)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/site-packages/xdg.py", line 40
def _getenv(variable: str, default: str) -> str:
^
SyntaxError: invalid syntax

Replace badge in readme

The badge that show build status is using Travis. It should be removed or replaced with a similar thing from github.

GCMD keywords versions are not pinned

Every time vocabularies are updated locally, the latest version from GCMD keywords is pulled.

This might be undesirable, especially when there are updates to existing keywords (it was the case recently for Sentinel-1: "SENTINEL-1A" became "Sentinel-1A" and the same for 1B).
This requires updating existing django-geo-spaas databases, for example.

It would be worthwhile to explore a bit the GCMD API and see if the version of keywords can be pinned in pythesintrc.yaml.
That way, fixed keyword versions would be associated with a pythesint release, increasing the reliability of the package.

Name wkv_variable in pythesintrc.yaml

wkv = well-known-variable, i.e., it should be enough to call it wkv but perhaps we should write it out to make it easier to understand for a user?
What about nersc_wkv or nersc_well_known_variable?

Migrate CI to Github Actions

Travis CI has changed its pricing policy; it is no longer free for open-source repositories (not without negotiating a quota).
We are migrating our CI to Github Actions.

Should we add other registries as well?

For example, enhetsregisteret has basic data about Norwegian organisations. It can be useful as a supplement to the gcmd_provider list. But since we actually store this data locally, pythesint can become quite big in terms of required storage. Do we want this? Or could we, e.g., only download data if indicated by a boolean flag (for example vocabularies that the user will use repeatedly)?

GCMD rucontenttype's structure evolved

In GCMD version 11.0, the rucontenttype GCMD vocabulary has an extra column: "URLContentType", which needs to be added in the categories part of the vocabulary in the pythesintrc.yaml file.

⚠️ If the new category is added, it won't be possible anymore to use the previous versions of the vocabulary.

Add linkage/mapping between vocabularies

The thesaurus name can be interpreted as meaning a translation between different standards, i.e., that in the end we will connect, e.g., the cf standard names to the gcmd science keywords.

From Cambridge Dictionaries Online: thesaurus is a ​type of ​dictionary in which words with ​similar ​meanings are ​arranged in ​groups

Consequently, we should have this linking/mapping between vocabularies in this repository. However, it may already exist - see: https://marinemetadata.org/community/teams/ont/vocharmony/vocmapping

python setup.py test does not work

File "/home/vagrant/miniconda/lib/python2.7/unittest/loader.py", line 100, in loadTestsFromName
    parent, obj = obj, getattr(obj, part)
AttributeError: 'module' object has no attribute 'test_pythesint'

This is done in the develop vm. It is strange that the tests are passing on travis, though...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.