nansencenter / py-thesaurus-interface Goto Github PK

An interface to metadata conventions for geospatial data

License: GNU General Public License v3.0

Python 99.08% Shell 0.92%

dif nasa keyword gcmd earth-observations standards geospaas

py-thesaurus-interface's Issues

Remove obsolete dependency on old versions of PyYaml, requests and xdg

For some reason old version of PyYaml, requests and xdg were listed in setup.py
It may cause problems with Python 3.10 in future

Issue with Python 35

Something is not working on python 3.5. One test fails on all platforms (but it works for both python 2.7 and 3.6)

.............................F...
======================================================================
FAIL: test_find_keyword (pythesint.tests.test_vocabulary.VocabularyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/vagrant/shared/pythesint_vm/py-thesaurus-interface/pythesint/tests/test_vocabulary.py", line 52, in test_find_keyword
    self.assertEqual(vocab.find_keyword('Animal'), self.animal)
AssertionError: OrderedDict([('Type', 'Cat'), ('Category', 'Animal'), ('Name', '')]) != OrderedDict([('Type', ''), ('Category', 'Animal'), ('Name', '')])

----------------------------------------------------------------------
Ran 33 tests in 44.271s

GCMD keywords versions are not pinned

Every time vocabularies are updated locally, the latest version from GCMD keywords is pulled.

This might be undesirable, especially when there are updates to existing keywords (it was the case recently for Sentinel-1: "SENTINEL-1A" became "Sentinel-1A" and the same for 1B).
This requires updating existing django-geo-spaas databases, for example.

It would be worthwhile to explore a bit the GCMD API and see if the version of keywords can be pinned in pythesintrc.yaml.
That way, fixed keyword versions would be associated with a pythesint release, increasing the reliability of the package.

GCMD rucontenttype's structure evolved

In GCMD version 11.0, the rucontenttype GCMD vocabulary has an extra column: "URLContentType", which needs to be added in the categories part of the vocabulary in the pythesintrc.yaml file.

⚠️ If the new category is added, it won't be possible anymore to use the previous versions of the vocabulary.

Build wheel as well as source package

For now we only build and publish a source package, it would be nice to build a wheel as well.

Problems with the "mmd_platform_type" vocabulary

Some updates to the source repository for this vocabulary have broken it.
(Probably this commit: metno/mmd@026c77e)

Updating the vocabulary does not work because there are empty definitions in the XML file
test_get_mmd_platform_type() fails because "maps/charts/photographs" has been removed

Should we add other registries as well?

For example, enhetsregisteret has basic data about Norwegian organisations. It can be useful as a supplement to the gcmd_provider list. But since we actually store this data locally, pythesint can become quite big in terms of required storage. Do we want this? Or could we, e.g., only download data if indicated by a boolean flag (for example vocabularies that the user will use repeatedly)?

python setup.py test does not work

File "/home/vagrant/miniconda/lib/python2.7/unittest/loader.py", line 100, in loadTestsFromName
    parent, obj = obj, getattr(obj, part)
AttributeError: 'module' object has no attribute 'test_pythesint'

This is done in the develop vm. It is strange that the tests are passing on travis, though...

Cannot import project keyword at python 3

I have got a problem with getting a project keywords from pythesint with python 3.6 (nansat VM). It works fine with py2 as well as with any other keywords (platforms, instruments) with py3.

>>> import pythesint as pti
>>> pti.get_gcmd_project('GHRSST')

Traceback (most recent call last):
  File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2961, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-8-29c8ed8ce8a4>", line 1, in <module>
    pti.get_gcmd_project('GHRSST')
  File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/site-packages/pythesint/vocabulary.py", line 45, in find_keyword
    for d in self.get_list():
  File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/site-packages/pythesint/json_vocabulary.py", line 17, in get_list
    return self.sort_list(json.load(open(self.get_filepath())))
  File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/json/__init__.py", line 299, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Migrate CI to Github Actions

Travis CI has changed its pricing policy; it is no longer free for open-source repositories (not without negotiating a quota).
We are migrating our CI to Github Actions.

The error message is unclear when a URL is unavailable

If a URL containing a vocabulary file is unavailable, attempts are still made to process the data.
An HTTP code different than 2xx should raise an exception directly.

Changed structure for MMD vocabularies source

The source for MMD vocabularies changed from several files (one per category of data) to one file containing everything.

The URLs for the GCMD vocabularies have changed

The URLs for GCMD vocabularies contained in the pythesintrc.yaml are no longer valid.
The static directory seems to have been abandoned and the RESTfull service should be used.

For example:

https://gcmdservices.gsfc.nasa.gov/static/kms/instruments/instruments.csv
is replaced by:
https://gcmdservices.gsfc.nasa.gov/kms/concepts/concept_scheme/instruments/?format=csv

Link http://gcmd.gsfc.nasa.gov appears to be dead in README

In file README.md

On line 43 the link http://gcmd.gsfc.nasa.gov appears to be dead.
Web archive show the last update on 29-March 2022 redirecting to https://web.archive.org/web/20220329102926/https://idn.ceos.org/ where the site https://idn.ceos.org/ does appear to be functional.

A DuckDuckGo search for the link returns top four:

The GCMD platform fields are back to their old names

The names of the fields for GCMD platforms have been reverted to their previous values.
The aliases need to be removed from the pythesintrc.yaml file.

Take config out of package

Well, at least add option to have it in a "user-controlled" location.

E.g. on unix/linux have a .pythesintrc.yaml file in home directory that is checked before loading default config from resource_string(__name__, 'pythesintrc.yaml')

Readme is not properly formatted on the PyPI page

Adding long_description_content_type="text/markdown to the setup.py file would probably solve this.

Seems it downloads json too often

Probably after the recent change to download location.

Will look into it tomorrow.

Automate PyPI package build

Build and upload the PyPI package automatically upon releases.

Change naming of "Thesaurus" classes to "Vocabulary"

This is also connected to issue #3

NASA GCMD URL has no security certficate

One solution is to not check certificate with requests.get(url, verify=False)
But is it safe?

https://gcmdservices.gsfc.nasa.gov/ is not accessible (retired)

The gcmdservices.gsfc.nasa.gov server is not accessible (retired) and has to be replaced with https://gcmd.earthdata.nasa.gov.
https://forum.earthdata.nasa.gov/viewtopic.php?p=10734

Add MET MMD controlled vocabularies

MMD is available here: https://github.com/metno/mmd

The controlled vocabularies are here: https://github.com/metno/mmd/tree/master/thesauri

Todo:

Add the MMD controlled vocabularies to pythesint
Write tests

CF specific code should not be in Vocabulary.find_keyword

Regarding this code:

        # OBS: This works for the gcmd keywords but makes no sense for the cf
        # standard names - therefore always search the cf standard names by
        # standard_name only..
        for m in matches:
            remaining = {}
            for i in ii:
                remaining[keys[i]] = m[keys[i]]
            if not any(val for val in remaining.itervalues()):
                return m

cf_vocabulary should rather override the find_keyword method, or the find_keyword method should be made more generic and then overriden.

I vote for generic and overriden, consider the following scenario:
What if there are multiple matches? e.g. I search for "Imaging Radars"? Then it should return all matches (currently it returns nothing, which I think is a bug).

Now, if it would return a list of all matches, then cf_vocabulary could override this, by calling the find_keyword from superclass, and then perform the search only on standard_name only.

BUT - why are we only searching standard names? I think there should be an option here. Perhaps I don't know what standard name to use, and want to search in the description also?

NASA changed URL of GCMD vocabularies

It was https://gcmdservices.gsfc.nasa.gov
Now it is https://gcmd.nasa.gov
Fix: update pythesintrc.yaml

Duplicates in the vocabulary.search() method results

If a GCMD object has multiple fields which match the keyword, Vocabulary.search() will append this object to the list of results as many times as it has matching fields.

File opened and not closed in json_vocabulary.py

In the JSONVocabulary.get_list() method, a file is opened and never closed.

can't install on ubuntu (miniconda/conda-forge)

When I do the command
pip install https://github.com/nansencenter/py-thesaurus-interface/archive/master.tar.gz

I get the error:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-MVU4Zj-build/setup.py", line 84, in
cmdclass = {'install_scripts': update_vocabularies}
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/site-packages/setuptools/init.py", line 129, in setup
return distutils.core.setup(**attrs)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/core.py", line 151, in setup
dist.run_commands()
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/site-packages/setuptools/command/install.py", line 61, in run
return orig.install.run(self)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/command/install.py", line 575, in run
self.run_command(cmd_name)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/tmp/pip-MVU4Zj-build/setup.py", line 31, in run
import pythesint as pti
File "pythesint/init.py", line 2, in
from pythesint.pythesint import update_all_vocabularies
File "pythesint/pythesint.py", line 39, in
_process_config()
File "pythesint/pythesint.py", line 24, in _process_config
fromlist=[name])
File "pythesint/wkv_vocabulary.py", line 6, in
from pythesint.json_vocabulary import JSONVocabulary
File "pythesint/json_vocabulary.py", line 7, in
from pythesint.pathsolver import DATA_HOME
File "pythesint/pathsolver.py", line 5, in
from xdg import (XDG_DATA_HOME)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/site-packages/xdg.py", line 40
def _getenv(variable: str, default: str) -> str:
^
SyntaxError: invalid syntax

Changes in GCMD platform data structure

The fields returned by the GCMD webservice for platforms have changed for all versions of the vocabulary.

It went from: Category, Series_Entity, Short_Name, Long_Name
to: Basis, Category, Sub_Category, Short_Name, Long_Name

As a result, the get_ and search_ function do not work anymore for platforms no matter which vocabulary version is specified.
We need to check all other vocabularies and adapt the code to work with the new fields, and decide whether the interface for platforms should change. Do we keep returning objects with the old fields, ensuring compatibility with existing software, or do we use the new fields, in which case all software using pythesint needs to be updated.

Replace badge in readme

The badge that show build status is using Travis. It should be removed or replaced with a similar thing from github.

Unit tests should be proper unit tests

Aleksander to create example with mock objects...

Create documentation

We put it on pypi, so we should make proper docs...

Add configuration to installation

Add configuration folder where available lists are specified, e.g., in folder ~/.pythesintrc - also put the json-files here

Conda package for Python 3.7

Vagrant VMs cannot install python 3.7 because we don't have a conda package for that version in pythesint.

To do:

Make Conda package with Python 3.7

Specify path to store JSON files

Path to store JSON file should not be fixed (as it is now) but provided as an optional parameter to JSONThesaurus methods.

CF standard name table is outdated

We're using version 30, whereas the latest one is number 76 (https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html).

Change to using the Marine Metadata Interoperability Ontology Registry and Repository (https://mmisw.org/ont/cf/parameter) - that should always have the latest version
We should note the CF version number and origin (as a uri) in the local copy - see how it is done in gcmd_vocabulary.py

Travis tries to upload the package twice on tag events

Since builds are run for two versions of Python, when a release is made Travis attempts to upload it twice, which means that one of the builds will always fail.

It can be solved by adding the skip_existing: true version to the deploy instruction:
https://docs.travis-ci.com/user/deployment/pypi/#upload-artifacts-only-once

Add linkage/mapping between vocabularies

The thesaurus name can be interpreted as meaning a translation between different standards, i.e., that in the end we will connect, e.g., the cf standard names to the gcmd science keywords.

From Cambridge Dictionaries Online: thesaurus is a type of dictionary in which words with similar meanings are arranged in groups

Consequently, we should have this linking/mapping between vocabularies in this repository. However, it may already exist - see: https://marinemetadata.org/community/teams/ont/vocharmony/vocmapping

Is it possible to create a new release with updates from last autumn?

The latest release is from 29th June 2020..

Name wkv_variable in pythesintrc.yaml

wkv = well-known-variable, i.e., it should be enough to call it wkv but perhaps we should write it out to make it easier to understand for a user?
What about nersc_wkv or nersc_well_known_variable?

can't install on ubuntu (/usr/bin/python)

When I do the command:
pip install https://github.com/nansencenter/py-thesaurus-interface/archive/master.tar.gz

I get the error:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-t97wYu-build/setup.py", line 84, in
cmdclass = {'install_scripts': update_vocabularies}
File "/usr/lib/python2.7/distutils/core.py", line 111, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 269, in init
self.fetch_build_eggs(attrs['setup_requires'])
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 313, in fetch_build_eggs
replace_conflicting=True,
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 826, in resolve
dist = best[req.key] = env.best_match(req, ws, installer)
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1092, in best_match
return self.obtain(req, installer)
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1104, in obtain
return installer(requirement)
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 380, in fetch_build_egg
return cmd.easy_install(req)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 663, in easy_install
return self.install_item(spec, dist.location, tmpdir, deps)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 693, in install_item
dists = self.install_eggs(spec, download, tmpdir)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 873, in install_eggs
return self.build_and_install(setup_script, setup_base)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1101, in build_and_install
self.run_setup(setup_script, setup_base, args)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1087, in run_setup
run_setup(setup_script, args)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 246, in run_setup
raise
File "/usr/lib/python2.7/contextlib.py", line 35, in exit
self.gen.throw(type, value, traceback)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 195, in setup_context
yield
File "/usr/lib/python2.7/contextlib.py", line 35, in exit
self.gen.throw(type, value, traceback)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 166, in save_modules
saved_exc.resume()
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 141, in resume
six.reraise(type, exc, self._tb)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 154, in save_modules
yield saved
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 195, in setup_context
yield
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 243, in run_setup
DirectorySandbox(setup_dir).run(runner)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 273, in run
return func()
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 242, in runner
_execfile(setup_script, ns)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 46, in _execfile
exec(code, globals, locals)
File "/tmp/easy_install-12gS2d/xdg-2.0.0/setup.py", line 19, in
REQS = ['PyYAML', 'requests', 'xdg;platform_system!="Windows"']
File "/tmp/easy_install-12gS2d/xdg-2.0.0/setup.py", line 12, in read_long_description
from setuptools import setup, find_packages
AttributeError: 'PosixPath' object has no attribute 'read_text'

There would be several solutions to fix that:

catching exceptions in the installation script to avoid failing (not a good solution in my opinion)
make it possible to disable the update on install (using env var or CLI argument)
call update_all_vocabularies() on import instead of on install (only if there are no local files present, to avoid calling it every time)

What do you think @akorosov?

nansencenter / py-thesaurus-interface Goto Github PK

py-thesaurus-interface's Issues

Recommend Projects

Recommend Topics

Recommend Org