nansencenter / py-thesaurus-interface Goto Github PK
View Code? Open in Web Editor NEWAn interface to metadata conventions for geospatial data
License: GNU General Public License v3.0
An interface to metadata conventions for geospatial data
License: GNU General Public License v3.0
For some reason old version of PyYaml, requests and xdg were listed in setup.py
It may cause problems with Python 3.10 in future
Something is not working on python 3.5. One test fails on all platforms (but it works for both python 2.7 and 3.6)
.............................F...
======================================================================
FAIL: test_find_keyword (pythesint.tests.test_vocabulary.VocabularyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/vagrant/shared/pythesint_vm/py-thesaurus-interface/pythesint/tests/test_vocabulary.py", line 52, in test_find_keyword
self.assertEqual(vocab.find_keyword('Animal'), self.animal)
AssertionError: OrderedDict([('Type', 'Cat'), ('Category', 'Animal'), ('Name', '')]) != OrderedDict([('Type', ''), ('Category', 'Animal'), ('Name', '')])
----------------------------------------------------------------------
Ran 33 tests in 44.271s
Every time vocabularies are updated locally, the latest version from GCMD keywords is pulled.
This might be undesirable, especially when there are updates to existing keywords (it was the case recently for Sentinel-1: "SENTINEL-1A" became "Sentinel-1A" and the same for 1B).
This requires updating existing django-geo-spaas databases, for example.
It would be worthwhile to explore a bit the GCMD API and see if the version of keywords can be pinned in pythesintrc.yaml.
That way, fixed keyword versions would be associated with a pythesint release, increasing the reliability of the package.
In GCMD version 11.0, the rucontenttype GCMD vocabulary has an extra column: "URLContentType", which needs to be added in the categories
part of the vocabulary in the pythesintrc.yaml
file.
For now we only build and publish a source package, it would be nice to build a wheel as well.
Some updates to the source repository for this vocabulary have broken it.
(Probably this commit: metno/mmd@026c77e)
test_get_mmd_platform_type()
fails because "maps/charts/photographs" has been removedFor example, enhetsregisteret has basic data about Norwegian organisations. It can be useful as a supplement to the gcmd_provider
list. But since we actually store this data locally, pythesint can become quite big in terms of required storage. Do we want this? Or could we, e.g., only download data if indicated by a boolean flag (for example vocabularies that the user will use repeatedly)?
File "/home/vagrant/miniconda/lib/python2.7/unittest/loader.py", line 100, in loadTestsFromName
parent, obj = obj, getattr(obj, part)
AttributeError: 'module' object has no attribute 'test_pythesint'
This is done in the develop vm. It is strange that the tests are passing on travis, though...
I have got a problem with getting a project keywords from pythesint
with python 3.6 (nansat VM). It works fine with py2 as well as with any other keywords (platforms, instruments) with py3.
>>> import pythesint as pti
>>> pti.get_gcmd_project('GHRSST')
Traceback (most recent call last):
File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2961, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-8-29c8ed8ce8a4>", line 1, in <module>
pti.get_gcmd_project('GHRSST')
File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/site-packages/pythesint/vocabulary.py", line 45, in find_keyword
for d in self.get_list():
File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/site-packages/pythesint/json_vocabulary.py", line 17, in get_list
return self.sort_list(json.load(open(self.get_filepath())))
File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/json/__init__.py", line 299, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/vagrant/anaconda/envs/py3nansat/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Travis CI has changed its pricing policy; it is no longer free for open-source repositories (not without negotiating a quota).
We are migrating our CI to Github Actions.
If a URL containing a vocabulary file is unavailable, attempts are still made to process the data.
An HTTP code different than 2xx should raise an exception directly.
The source for MMD vocabularies changed from several files (one per category of data) to one file containing everything.
The URLs for GCMD vocabularies contained in the pythesintrc.yaml
are no longer valid.
The static directory seems to have been abandoned and the RESTfull service should be used.
For example:
https://gcmdservices.gsfc.nasa.gov/static/kms/instruments/instruments.csv
is replaced by:
https://gcmdservices.gsfc.nasa.gov/kms/concepts/concept_scheme/instruments/?format=csv
In file README.md
On line 43 the link http://gcmd.gsfc.nasa.gov appears to be dead.
Web archive show the last update on 29-March 2022 redirecting to https://web.archive.org/web/20220329102926/https://idn.ceos.org/ where the site https://idn.ceos.org/ does appear to be functional.
A DuckDuckGo search for the link returns top four:
The names of the fields for GCMD platforms have been reverted to their previous values.
The aliases need to be removed from the pythesintrc.yaml
file.
Well, at least add option to have it in a "user-controlled" location.
E.g. on unix/linux have a .pythesintrc.yaml file in home directory that is checked before loading default config from resource_string(__name__, 'pythesintrc.yaml')
Adding long_description_content_type="text/markdown
to the setup.py
file would probably solve this.
Probably after the recent change to download location.
Will look into it tomorrow.
Build and upload the PyPI package automatically upon releases.
This is also connected to issue #3
One solution is to not check certificate with requests.get(url, verify=False)
But is it safe?
The gcmdservices.gsfc.nasa.gov server is not accessible (retired) and has to be replaced with https://gcmd.earthdata.nasa.gov.
https://forum.earthdata.nasa.gov/viewtopic.php?p=10734
MMD is available here: https://github.com/metno/mmd
The controlled vocabularies are here: https://github.com/metno/mmd/tree/master/thesauri
Todo:
Add the MMD controlled vocabularies to pythesint
Write tests
Regarding this code:
# OBS: This works for the gcmd keywords but makes no sense for the cf
# standard names - therefore always search the cf standard names by
# standard_name only..
for m in matches:
remaining = {}
for i in ii:
remaining[keys[i]] = m[keys[i]]
if not any(val for val in remaining.itervalues()):
return m
cf_vocabulary should rather override the find_keyword method, or the find_keyword method should be made more generic and then overriden.
I vote for generic and overriden, consider the following scenario:
What if there are multiple matches? e.g. I search for "Imaging Radars"? Then it should return all matches (currently it returns nothing, which I think is a bug).
Now, if it would return a list of all matches, then cf_vocabulary could override this, by calling the find_keyword from superclass, and then perform the search only on standard_name only.
BUT - why are we only searching standard names? I think there should be an option here. Perhaps I don't know what standard name to use, and want to search in the description also?
It was https://gcmdservices.gsfc.nasa.gov
Now it is https://gcmd.nasa.gov
Fix: update pythesintrc.yaml
If a GCMD object has multiple fields which match the keyword, Vocabulary.search()
will append this object to the list of results as many times as it has matching fields.
In the JSONVocabulary.get_list()
method, a file is opened and never closed.
When I do the command
pip install https://github.com/nansencenter/py-thesaurus-interface/archive/master.tar.gz
I get the error:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-MVU4Zj-build/setup.py", line 84, in
cmdclass = {'install_scripts': update_vocabularies}
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/site-packages/setuptools/init.py", line 129, in setup
return distutils.core.setup(**attrs)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/core.py", line 151, in setup
dist.run_commands()
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/site-packages/setuptools/command/install.py", line 61, in run
return orig.install.run(self)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/command/install.py", line 575, in run
self.run_command(cmd_name)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/tmp/pip-MVU4Zj-build/setup.py", line 31, in run
import pythesint as pti
File "pythesint/init.py", line 2, in
from pythesint.pythesint import update_all_vocabularies
File "pythesint/pythesint.py", line 39, in
_process_config()
File "pythesint/pythesint.py", line 24, in _process_config
fromlist=[name])
File "pythesint/wkv_vocabulary.py", line 6, in
from pythesint.json_vocabulary import JSONVocabulary
File "pythesint/json_vocabulary.py", line 7, in
from pythesint.pathsolver import DATA_HOME
File "pythesint/pathsolver.py", line 5, in
from xdg import (XDG_DATA_HOME)
File "/home/timill/Packages/python/miniconda2-conda-forge/lib/python2.7/site-packages/xdg.py", line 40
def _getenv(variable: str, default: str) -> str:
^
SyntaxError: invalid syntax
The fields returned by the GCMD webservice for platforms have changed for all versions of the vocabulary.
It went from: Category, Series_Entity, Short_Name, Long_Name
to: Basis, Category, Sub_Category, Short_Name, Long_Name
As a result, the get_
and search_
function do not work anymore for platforms no matter which vocabulary version is specified.
We need to check all other vocabularies and adapt the code to work with the new fields, and decide whether the interface for platforms should change. Do we keep returning objects with the old fields, ensuring compatibility with existing software, or do we use the new fields, in which case all software using pythesint needs to be updated.
The badge that show build status is using Travis. It should be removed or replaced with a similar thing from github.
Aleksander to create example with mock objects...
We put it on pypi, so we should make proper docs...
Add configuration folder where available lists are specified, e.g., in folder ~/.pythesintrc - also put the json-files here
Vagrant VMs cannot install python 3.7 because we don't have a conda package for that version in pythesint.
To do:
Path to store JSON file should not be fixed (as it is now) but provided as an optional parameter to JSONThesaurus methods.
We're using version 30, whereas the latest one is number 76 (https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html).
Change to using the Marine Metadata Interoperability Ontology Registry and Repository (https://mmisw.org/ont/cf/parameter) - that should always have the latest version
We should note the CF version number and origin (as a uri) in the local copy - see how it is done in gcmd_vocabulary.py
Since builds are run for two versions of Python, when a release is made Travis attempts to upload it twice, which means that one of the builds will always fail.
It can be solved by adding the skip_existing: true
version to the deploy
instruction:
https://docs.travis-ci.com/user/deployment/pypi/#upload-artifacts-only-once
The thesaurus name can be interpreted as meaning a translation between different standards, i.e., that in the end we will connect, e.g., the cf standard names to the gcmd science keywords.
From Cambridge Dictionaries Online: thesaurus is a type of dictionary in which words with similar meanings are arranged in groups
Consequently, we should have this linking/mapping between vocabularies in this repository. However, it may already exist - see: https://marinemetadata.org/community/teams/ont/vocharmony/vocmapping
The latest release is from 29th June 2020..
wkv = well-known-variable, i.e., it should be enough to call it wkv
but perhaps we should write it out to make it easier to understand for a user?
What about nersc_wkv
or nersc_well_known_variable
?
When I do the command:
pip install https://github.com/nansencenter/py-thesaurus-interface/archive/master.tar.gz
I get the error:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-t97wYu-build/setup.py", line 84, in
cmdclass = {'install_scripts': update_vocabularies}
File "/usr/lib/python2.7/distutils/core.py", line 111, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 269, in init
self.fetch_build_eggs(attrs['setup_requires'])
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 313, in fetch_build_eggs
replace_conflicting=True,
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 826, in resolve
dist = best[req.key] = env.best_match(req, ws, installer)
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1092, in best_match
return self.obtain(req, installer)
File "/usr/lib/python2.7/dist-packages/pkg_resources/init.py", line 1104, in obtain
return installer(requirement)
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 380, in fetch_build_egg
return cmd.easy_install(req)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 663, in easy_install
return self.install_item(spec, dist.location, tmpdir, deps)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 693, in install_item
dists = self.install_eggs(spec, download, tmpdir)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 873, in install_eggs
return self.build_and_install(setup_script, setup_base)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1101, in build_and_install
self.run_setup(setup_script, setup_base, args)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1087, in run_setup
run_setup(setup_script, args)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 246, in run_setup
raise
File "/usr/lib/python2.7/contextlib.py", line 35, in exit
self.gen.throw(type, value, traceback)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 195, in setup_context
yield
File "/usr/lib/python2.7/contextlib.py", line 35, in exit
self.gen.throw(type, value, traceback)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 166, in save_modules
saved_exc.resume()
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 141, in resume
six.reraise(type, exc, self._tb)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 154, in save_modules
yield saved
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 195, in setup_context
yield
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 243, in run_setup
DirectorySandbox(setup_dir).run(runner)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 273, in run
return func()
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 242, in runner
_execfile(setup_script, ns)
File "/usr/lib/python2.7/dist-packages/setuptools/sandbox.py", line 46, in _execfile
exec(code, globals, locals)
File "/tmp/easy_install-12gS2d/xdg-2.0.0/setup.py", line 19, in
REQS = ['PyYAML', 'requests', 'xdg;platform_system!="Windows"']
File "/tmp/easy_install-12gS2d/xdg-2.0.0/setup.py", line 12, in read_long_description
from setuptools import setup, find_packages
AttributeError: 'PosixPath' object has no attribute 'read_text'
The revision information stored in the GCMD vocabularies JSON files causes an empty dict to be present in the output of JSONVocabulary.get_list()
.
This causes the update of the database vocabularies in django-geo-spaas to fail.
There seems to be a Python 3 issue with Byte-string's on the response data.
Right now, update_all_vocabularies()
is run every time the package is installed.
If something goes wrong during the execution, the installation of the package fails.
This kind of failure is often caused by a change on the source of one of the vocabularies, which can be solved quite easily by changing the version of the source that is used. But if the installation fails, it is impossible to try that.
There would be several solutions to fix that:
update_all_vocabularies()
on import instead of on install (only if there are no local files present, to avoid calling it every time)What do you think @akorosov?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.