esgf / esgf-pyclient Goto Github PK

View Code? Open in Web Editor NEW

32.0 14.0 17.0 473 KB

Search client for the ESGF Search API

Home Page: https://esgf-pyclient.readthedocs.io/en/latest/

License: BSD 3-Clause "New" or "Revised" License

Python 80.88% Makefile 1.91% Jupyter Notebook 17.21%

esgf opendap search logon

esgf-pyclient's People

Contributors

Stargazers

Watchers

Forkers

agoodm soay laliberte meitdev philipkershaw pchengi nedclimaterisk cehbrecht cojacoo cofinoa cocomice alaniwi bouweandela larsbuntemeyer svenrdz yeahartem jimcircadian

esgf-pyclient's Issues

500 server error with wildcard facets

I am trying to get started with some simple queries and I noticed that if I don't give a value for "facets" I get a 500 Server Error:

from pyesgf.search import SearchConnection
conn = SearchConnection('https://esgf-node.llnl.gov/esg-search/', distrib=True)
ctx = conn.new_context(variable='tas', time_frequency='mon')
ctx.hit_count
...
HTTPError: 500 Server Error: 500 for url: https://esgf-node.llnl.gov/esg-search/search?format=application%2Fsolr%2Bjson&limit=0&distrib=false&type=Dataset&variable=tas&time_frequency=mon&facets=%2A

But if I set a value for facets (e.g., ctx = conn.new_context(variable='tas', time_frequency='mon', facets='null')), the search is returned successfully.

I think %2A, which appears to be the default value for facets should interpreted as a wildcard (*).

Is this expected behavior? Should I just specify some null value for facets (e.g., 0)?

`ignore_facet_check` search option appears to be broken

When I run this script:

import logging

import pyesgf.search


def example():
    logging.basicConfig(format="%(asctime)s [%(process)d] %(levelname)-8s "
                        "%(name)s,%(lineno)s\t%(message)s")
    pyesgf.search.connection.log.setLevel(logging.DEBUG)
    conn = pyesgf.search.SearchConnection(
        url='http://esgf-node.llnl.gov/esg-search')
    ctx = conn.new_context(project='CMIP5')
    ctx.search(ignore_facet_check=True)


if __name__ == '__main__':
    example()

the code crashes with the following output:

DEBUG:pyesgf.search.connection:Query dict is MultiDict([('format', 'application/solr+json'), ('limit', 0), ('distrib', 'true'), ('type', 'Dataset'), ('project', 'CMIP5')])
DEBUG:pyesgf.search.connection:Query request is http://esgf-node.llnl.gov/esg-search/search?format=application%2Fsolr%2Bjson&limit=0&distrib=true&type=Dataset&project=CMIP5
Traceback (most recent call last):
  File "/home/bandela/src/esmvalgroup/esmvalcore/try_filesearch.py", line 96, in <module>
    example()
  File "/home/bandela/src/esmvalgroup/esmvalcore/try_filesearch.py", line 92, in example
    ctx.search(ignore_facet_check=True)
  File "/home/bandela/conda/envs/esmvaltool/lib/python3.9/site-packages/pyesgf/search/context.py", line 126, in search
    sc.__update_counts(ignore_facet_check=ignore_facet_check)
  File "/home/bandela/conda/envs/esmvaltool/lib/python3.9/site-packages/pyesgf/search/context.py", line 207, in __update_counts
    for facet, counts in (list(response['facet_counts']['facet_fields'].items())):
KeyError: 'facet_counts'

issue with SHARD_REXP

Hi,

I'm trying to use the module but every time I do a distributed search I bump into this error:

pyesgf.search.exceptions.EsgfSearchException: Shard spec esgf-node.jpl.nasa.gov/solr/datasets not recognised

SHARD_REXP = r'(?P.?):(?P\d)/solr(?P.*)'
changing this to
SHARD_REXP = r'(?P.?)(?P\d)/solr(?P.)'

in consts.py fixes it

Thanks,

Paola

Would it be possible to make a new release? @agstephens or @cehbrecht? Our users are complaining that search is slow ESMValGroup/ESMValCore#1495. I think that now that #75 has been fixed, we could speed up our search by almost a factor of 2.

Import trial of `MyProxyClient` in `pyesgf/logon.py` outputs misleading error and incompatibility with `cryptography` from Anaconda `main` channel

@agstephens et al, here are two issues in one, the reason why I chose to open one single issue instead of two is that these two things are related, in actuality it's a cause and effect one:

main issue: cryptography dep from Anaconda/main channel is incompatible with noarch/esgf-pyclient=0.3.1 from conda-forge, and am afraid that that exact one gets pulled in when installing esgf-pyclient from conda-forge; the incompatibility throws an openssl-related error from within cryptography:

from myproxy.client import MyProxyClient

results in

Traceback (most recent call last):
  File "/home/valeriu/ESMValCore/testimp.py", line 1, in <module>
    from myproxy.client import MyProxyClient
  File "/home/valeriu/miniconda3/envs/experimental-all-conda/lib/python3.10/site-packages/myproxy/client/__init__.py", line 42, in <module>
    from OpenSSL import crypto, SSL
  File "/home/valeriu/miniconda3/envs/experimental-all-conda/lib/python3.10/site-packages/OpenSSL/__init__.py", line 8, in <module>
    from OpenSSL import crypto, SSL
  File "/home/valeriu/miniconda3/envs/experimental-all-conda/lib/python3.10/site-packages/OpenSSL/crypto.py", line 11, in <module>
    from OpenSSL._util import (
  File "/home/valeriu/miniconda3/envs/experimental-all-conda/lib/python3.10/site-packages/OpenSSL/_util.py", line 5, in <module>
    from cryptography.hazmat.bindings.openssl.binding import Binding
  File "/home/valeriu/miniconda3/envs/experimental-all-conda/lib/python3.10/site-packages/cryptography/hazmat/bindings/openssl/binding.py", line 14, in <module>
    from cryptography.hazmat.bindings._openssl import ffi, lib
ImportError: libssl.so.1.1: cannot open shared object file: No such file or directory
(experimental-all-conda) valeriu@valeriu-PORTEGE-Z30-C:~/ESMValCore$ vim testimp.py
(experimental-all-conda) valeriu@valeriu-PORTEGE-Z30-C:~/ESMValCore$ python testimp.py
Traceback (most recent call last):
  File "/home/valeriu/ESMValCore/testimp.py", line 1, in <module>
    from myproxy.client import MyProxyClient
  File "/home/valeriu/miniconda3/envs/experimental-all-conda/lib/python3.10/site-packages/myproxy/client/__init__.py", line 42, in <module>
    from OpenSSL import crypto, SSL
  File "/home/valeriu/miniconda3/envs/experimental-all-conda/lib/python3.10/site-packages/OpenSSL/__init__.py", line 8, in <module>
    from OpenSSL import crypto, SSL
  File "/home/valeriu/miniconda3/envs/experimental-all-conda/lib/python3.10/site-packages/OpenSSL/crypto.py", line 11, in <module>
    from OpenSSL._util import (
  File "/home/valeriu/miniconda3/envs/experimental-all-conda/lib/python3.10/site-packages/OpenSSL/_util.py", line 5, in <module>
    from cryptography.hazmat.bindings.openssl.binding import Binding
  File "/home/valeriu/miniconda3/envs/experimental-all-conda/lib/python3.10/site-packages/cryptography/hazmat/bindings/openssl/binding.py", line 14, in <module>
    from cryptography.hazmat.bindings._openssl import ffi, lib
ImportError: libssl.so.1.1: cannot open shared object file: No such file or directory

-> now this is not yer fault since myproxyclient is the package that is at fault here, but this is a heads up and maybe you see about this with myproxyclient guys

the secondary issue, which I suggest you guys fix in yer code pyesgf/logon.py is related to the main issue but masks it at import trial:

try:
    from myproxy.client import MyProxyClient
    import OpenSSL
    _has_myproxy = True
except (ImportError, SyntaxError):
    _has_myproxy = False

please catch the import exception and print it when _has_myproxy = False since that way you let the user know what the actual offending import is! 🍺

Check tests work on master, and assess PR: 68

Look into test failures and accept PR if we get it all working.

#68

You can see from the PR, some of the unit tests are failing when running as GitHub Actions (i.e. continuous integration on github). You can click and review the details to see which tests are failing. It would be worth checking out master first and running the tests on that to check whether there is a difference in the two branches - or whether the tests are failing on both.

Should Attribute Service calls work?

Hi @philipkershaw, another question for you.

We have a call to the Attribute Service as defined in these tests:

class TestATS(TestCase):
    @pytest.mark.xfail(reason='This test does not work anymore.')
    def test_ceda_ats(self):
        service = AttributeService(CEDA_NODE.ats_url, 'esgf-pyclient')
        fn, ln = 'Ag', 'Stephens'
        resp = service.send_request(OPENID, ['urn:esg:first:name',
                                             'urn:esg:last:name'])

        assert resp.get_subject() == OPENID

        attrs = resp.get_attributes()
        assert attrs['urn:esg:first:name'] == fn
        assert attrs['urn:esg:last:name'] == ln

    @pytest.mark.xfail(reason='This test does not work anymore.')
    def test_multi_attribute(self):
        service = AttributeService(CEDA_NODE.ats_url, 'esgf-pyclient')

        resp = service.send_request(OPENID, ['CMIP5 Research'])

        attrs = resp.get_attributes()
        assert list(sorted(attrs['CMIP5 Research'])) == ['default', 'user']

Both tests fail at present. Do you know if this is because something has changed in the attribute service or whether there is just a configuration problem? These tests live in:

https://github.com/ESGF/esgf-pyclient/blob/master/pyesgf/test/test_ats.py

Test and accept PR:53

See: #53

adding a download option

Hi Ag,

first of all I should say that I'm a data "expert" for a climate science centre in Australia and I've been using the esgf-pyclient embedding it into a python interface to our local collection at NCI. So our users can compare what's available locally to what is online to a fine detail. Claire Trenham from NCI forwarded to me an e-mail conversation from the esgf-devel mailing list regarding the need of a "search and download" tool. Both synda and pyesgf were mentioned, I've never used synda though I will have a go now, pyesgf was good for me because it gives a lot of details, which I found necessary to compare files when the version information is missing.
I'm really interested in any progress on this discussion. In fact, when I chose pyesgf I assumed I could use it to download the files too. I was surprised because it is taking care of the certificates with the logon function, you can extract very easily the files download urls and checksums, but then there's no download option. I've actually tried to add one by myself, but I didn't have time to do it properly and I set it aside. Our python module was developed to fill a hole in the services we have (or better we don't have) available.
So if you decide to add this enhancement I'll be happy to be a tester. Though I'll be away for the next two months and back in mid-July.

Regards,

Paola

ssl verification error when using logon_with_openid

When using the logon_with_openid method to retrieve a certificate from a myproxy server one might get an ssl verification error. It would be nice if we could make ssl verification optional in this case.

See:
https://github.com/ESGF/esgf-pyclient/blob/master/pyesgf/logon.py#L196

Could be replaced for example with requests:

response = requests.get(openid, verify=False)
xml = etree.parse(BytesIO(response.content))

Can be changed after merge of PR #14.

lm.logon timeout

I'm attempting to use esgf-pyclient to help download some data, but am stuck logging on.

I have an OpenID account with CEDA. Which is https://ceda.ac.uk/openid/Thomas.Crocker
My username to login at CEDA is tcrocker

All my attempts to connect all lead to: TimeoutError: [Errno 110] Connection timed out

I have tried:

$ OPENID = 'https://ceda.ac.uk/openid/Thomas.Crocker'

$ lm.logon_with_openid(openid=OPENID, password=None, bootstrap=True)
Enter myproxy username: tcrocker
Enter password for tcrocker:

and

$ proxyhost = 'esgf-index1.ceda.ac.uk'

$ lm.logon(hostname=proxyhost, interactive=True, bootstrap=True)
Enter myproxy username: tcrocker
Enter password for tcrocker:

and the same as above but with proxyhost set to esgf.ceda.ac.uk

Can anyone advise how to get this to work? I am based at the UK Met Office so I wonder if the problem could be related to our network firewall in some way?

access failure when trying to download opendap from an specific node.

Hi,
I am trying to download cordex data sets. I have created an account on esg-dn1.nsc.liu.se data node.
My openID is:
'https://esg-dn1.nsc.liu.se/esgf-idp/openid/XXXXX'

I use pyclient to download series of simulations for a location (lat, lon). My search is sucessful, how ever, why I try to get the download, I get a Access Failure message.
could you please let me what part of my scripts I am doing wrong ?
I successfully logon using my openid and password.

here is the script:

from pyesgf.search import SearchConnection
conn = SearchConnection('https://esg-dn1.nsc.liu.se/esg-search', distrib=True)
ctx = conn.new_context(
project='CORDEX',
variable = ['pr'],
time_frequency = '3hr',
domain = 'MNA-44',
data_node = 'esg-dn1.nsc.liu.se'
)
ctx.hit_count
rslts = ctx.search()

urls = [] # get the urls here.
for r in rslts:
files = r.file_context().search()
for file in files:
if file.opendap_url is not None:
urls.append(file.opendap_url)

for url in urls:
path, filename = os.path.split(url)
print('downloading {}'.format(filename))
lat_v = 29.639659
lon_v = 52.569935
ds = xr.open_dataset(url)
**_

data = ds['pr']

_** ,-------------------------gives the Access Failure errror
da = data.sel(rlat=lat_v, rlon=lon_v, method = 'nearest')

        da.to_netcdf(filename)
        print('saved file {}'.format(filename))

ERROR MESSAGE:

OSError: [Errno -77] NetCDF: Access failure: b'http://esg-dn1.nsc.liu.se/thredds/dodsC/esg_dataroot3/cordexdata/cordex/output/MNA-44/SMHI/CNRM-CERFACS-CNRM-CM5/rcp85/r1i1p1/SMHI-RCA4/v1/3hr/pr/v20180109/pr_MNA-44_CNRM-CERFACS-CNRM-CM5_rcp85_r1i1p1_SMHI-RCA4_v1_3hr_200601010130-200612312230.nc'

Batch downloading >1000 files

Hi,

I'm wondering if there is a supported method for downloading files in batch numbers greater than 1000 using this tool. I'm running into an issue where if the script exceeds 1000, I cannot download the entire set. For example:

Warning! The total number of files was 3222 but this script will only process 1000.
Script created for 1000 file(s)
(The count won't match if you manually edit this file!)

I would like to know if there's a way of either increasing this limit or creating multiple wget scripts that can then be run in succession.

Logon Help

@agstephens and anyone else that sees this. Can I get some help getting the logonManager to work again. I'm getting

OpenSSL.SSL.Error: [('SSL routines', 'SSL3_GET_SERVER_CERTIFICATE', 'certificate verify failed')]

heres are the steps for recreating

cd esgf-pyclient
virtualenv env
source env/bin/activate
pip install MyProxyClient
python setup.py install

I removed my ~/.esg and then went to pcmdi9 logged in search and clicked a wget, and ran it. which generated a new clean ~/.esg directory

I wrote a simple example test-log-on.py from the example given

import pyesgf.logon
lm = pyesgf.logon.LogonManager()
lm.logoff()
lm.is_logged_on()
lm.logon_with_openid('https://pcmdi9.llnl.gov/esgf-idp/openid/mattben', 'PassWord')
lm.is_logged_on()

this is the output

(env)harris112@harris112ml1:[esgf-pyclient]:[master]:[15979]> python test-log-on.py 
Traceback (most recent call last):
  File "test-log-on.py", line 8, in <module>
    lm.logon_with_openid('https://pcmdi9.llnl.gov/esgf-idp/openid/mattben', '1Lakehole!')
  File "/Users/harris112/projects/ESGF/esgf-pyclient/pyesgf/logon.py", line 140, in logon_with_openid
    interactive=interactive)
  File "/Users/harris112/projects/ESGF/esgf-pyclient/pyesgf/logon.py", line 176, in logon     bootstrap=bootstrap, updateTrustRoots=update_trustroots)
      File "/Users/harris112/projects/ESGF/esgf-pyclient/env/lib/python2.7/site-ackages/myproxy/client.py", line 1412, in logon
**getTrustRootsKw)
  File "/Users/harris112/projects/ESGF/esgf-pyclient/env/lib/python2.7/site-packages/myproxy/client.py", line 1564, in getTrustRoots
conn.write('0')
  File "/Users/harris112/projects/ESGF/esgf-pyclient/env/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1271, in send
self._raise_ssl_error(self._ssl, result)
  File "/Users/harris112/projects/ESGF/esgf-pyclient/env/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1187, in _raise_ssl_error
_raise_current_error()
  File "/Users/harris112/projects/ESGF/esgf-pyclient/env/lib/python2.7/site-packages/OpenSSL/_util.py", line 48, in exception_from_error_queue
raise exception_type(errors)
OpenSSL.SSL.Error: [('SSL routines', 'SSL3_GET_SERVER_CERTIFICATE', 'certificate verify failed')]

Am I missing something? Any Help would be appreciated.

Sorry for the spamming: @LucaCinquini @ncaripsl @prashanth Dwarakanath @philipkershaw @ sashakames

openssl.ssl.error ('ssl routines' 'ssl3_get_record' 'wrong version number')

I am running a download script using openid for CORDEX data. It is a script that I have been using successfully on different computers (windows, masosx and linux). I am trying to use it in other computers with first_time=True:
lm = LogonManager()
lm.logon_with_openid(openid=openid, password=password, bootstrap=first_time)

and the error:
File "/home/lloarca/climate_change/cordex/download.py", line 42, in search
lm.logon_with_openid(openid=openid, password=password, bootstrap=first_time)
File "/opt/anaconda3/lib/python3.7/site-packages/pyesgf/logon.py", line 149, in logon_with_openid
interactive=interactive)
File "/opt/anaconda3/lib/python3.7/site-packages/pyesgf/logon.py", line 185, in logon
updateTrustRoots=update_trustroots)
File "/opt/anaconda3/lib/python3.7/site-packages/myproxy/client.py", line 1448, in logon
**getTrustRootsKw)
File "/opt/anaconda3/lib/python3.7/site-packages/myproxy/client.py", line 1605, in getTrustRoots
conn.write(self.class.GLOBUS_INIT_MSG)
File "/opt/anaconda3/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1757, in send
self._raise_ssl_error(self._ssl, result)
File "/opt/anaconda3/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1671, in _raise_ssl_error
_raise_current_error()
File "/opt/anaconda3/lib/python3.7/site-packages/OpenSSL/_util.py", line 54, in exception_from_error_queue
raise exception_type(errors)
OpenSSL.SSL.Error: [('SSL routines', 'ssl3_get_record', 'wrong version number')]

I have openssl version 1.0.1k on one of them and 1.1.0i on another one. On both I get the same error. Any thoughts?

.dods_cookies block the data access

I'm using esgf-pyclient for accessing the data from ESGF. When user try to access the CORDEX data who is not registered to access the CORDEX data, The cookies store in .dods_cookies file. On next request, .dods_cookies block the data access for registered users also. I would like to know about the .dods_cookies file and provide me support.

Thank you.

SSL error

I get a SSL error when I try to log in with OPENID. It seems that there is a mismatch between OpenSSL versions between my environment and the LLNL node. Would you be able to tell me on version of pyopenssl ESGF Pyclient is built?

Is there a way to search the entire ESGF?

Different search urls provide very different results, is there some way to search all data available on ESGF?

For example:

>>> pyesgf.search.SearchConnection(url='https://esgf-data.dkrz.de/esg-search', distrib=True).new_context().facet_counts['project']
{'wind': 1, 'uerra': 2, 'tracmip': 6767, 'reklies-index': 28792, 'obs4MIPs': 2, 'monthlyfc': 2710, 'input4mips': 5832, 'hiresireland': 66, 'TEST': 4, 'TAMIP': 192, 'PMIP3': 16, 'MiKlip': 5568, 'MPI-GE': 55111, 'LUCID': 112, 'CORDEX-Reklies': 7017, 'CORDEX-ESD': 1370, 'CORDEX': 67908, 'CMIP6': 874263, 'CMIP5': 53725}
>>> pyesgf.search.SearchConnection(url='http://esgf-index1.ceda.ac.uk/esg-search', distrib=True).new_context().facet_counts['project']
{'specs': 427949, 'obs4MIPs': 27, 'eucleia': 1921, 'clipc': 104, 'TAMIP': 640, 'PMIP3': 10, 'GeoMIP': 233, 'CORDEX': 5880, 'CMIP5': 48143}
>>> pyesgf.search.SearchConnection(url='http://esgf-node.llnl.gov/esg-search', distrib=True).new_context().facet_counts['project']
{'wind': 1, 'uerra': 2, 'tracmip': 6767, 'specs': 446693, 'reklies-index': 28792, 'psipps': 1, 'primavera': 6400, 'obs4MIPs': 218, 'ncpp2013': 17, 'monthlyfc': 2710, 'input4mips': 11492, 'input4MIPs': 201, 'hiresireland': 66, 'eucleia': 1921, 'e3sm-supplement': 53, 'e3sm': 813, 'cmip3': 71, 'clipc': 114, 'cc4e': 497, 'c3se': 184, 'c3s-cmip5-adjust': 188, 'ana4MIPs': 7, 'TEST': 7, 'TAMIP': 1536, 'PMIP3': 361, 'NEXGDDP': 3, 'NEX': 10, 'NARR_Hydrology': 85, 'MiKlip': 5568, 'MPI-GE': 55111, 'LUCID': 318, 'ISIMIP3b': 550, 'ISIMIP3a': 111, 'ISIMIP2b': 95963, 'ISIMIP2a': 13803, 'ISIMIP2 Phase a': 288, 'ISI-MIP Fast Track': 856, 'GeoMIP': 757, 'EUCLIPSE': 41, 'CREATE-IP': 110, 'CORDEX-Reklies': 7017, 'CORDEX-ESD': 1370, 'CORDEX-Adjust': 1221, 'CORDEX': 183980, 'CMIP6': 11174039, 'CMIP5': 206811, 'CMIP3': 29331, 'CDAT-sample': 1, 'BioClim': 2, 'ACME': 23}

Search API documentation not available on readthedocs

The Search API documentation is not displaying on readthedocs: https://esgf-pyclient.readthedocs.io/en/latest/api.html#search-api.

It looks like this happens because the build fails to import the dependencies of the package, because when I create a conda environment from docs/environment.yml and run make html the result is:

sphinx-build -b html -d build/doctrees   source build/html
Running Sphinx v3.2.1
making output directory... done
building [mo]: targets for 0 po files that are out of date
building [html]: targets for 13 source files that are out of date
updating environment: [new config] 13 added, 0 changed, 0 removed
reading sources... [100%] quickstart                                                                                                                                                                                                           
WARNING: autodoc: failed to import module 'search' from module 'pyesgf'; the following exception was raised:
No module named 'requests_cache'
WARNING: autodoc: failed to import module 'search.connection' from module 'pyesgf'; the following exception was raised:
No module named 'requests_cache'
WARNING: autodoc: failed to import module 'search.context' from module 'pyesgf'; the following exception was raised:
No module named 'requests_cache'
WARNING: autodoc: failed to import module 'search.results' from module 'pyesgf'; the following exception was raised:
No module named 'requests_cache'
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [100%] quickstart                                                                                                                                                                                                            
generating indices...  genindex py-modindexdone
copying notebooks ... [100%] notebooks/examples/search.ipynb                                                                                                                                                                                   
highlighting module code... [100%] pyesgf.logon                                                                                                                                                                                                
writing additional pages...  searchdone
copying static files... ... done
copying extra files... done
dumping search index in English (code: en)... done
dumping object inventory... done
build succeeded, 4 warnings.

The HTML pages are in build/html.

Build finished. The HTML pages are in build/html

Enable "offset" and "limit" in the API

Ref: https://github.com/ESGF/esgf.github.io/wiki/ESGF_Search_REST_API

fix docu: using conda-forge channel

Please correct the conda install part in the docs to use the conda-forge channel:

https://github.com/ESGF/esgf-pyclient/blob/master/docs/index.rst

Should be:

$ conda install -c conda-forge esgf-pyclient

Facets warning with aggregation_context(): unexpected keyword

Hello,

According to the documentation, we always need to supply facets to the search_context(). If not, we get the following warning:

Warning - defaulting to search with facets=*

This behavior is kept for backward-compatibility, but ESGF indexes might not
successfully perform a distributed search when this option is used, so some
results may be missing.  For full results, it is recommended to pass a list of
facets of interest when instantiating a context object.  For example,

      ctx = conn.new_context(facets='project,experiment_id')

Only the facets that you specify will be present in the facets_counts dictionary.

This warning is displayed when a distributed search is performed while using the
facets=* default, a maximum of once per context object.  To suppress this warning,
set the environment variable ESGF_PYCLIENT_NO_FACETS_STAR_WARNING to any value
or explicitly use  conn.new_context(facets='*')

-------------------------------------------------------------------------------

However, the problem is that this warning also appears for aggregation_context(), even though aggregation_context() does not take facets as parameter. Even if I create a new_context() with facets and then create an aggregation_context() out of the ctx.search(), I get the facets warning. (For example with this piece of code:

facets = "source_id"
ctx = conn.new_context(
    project='CMIP6',
    experiment_id="historical",
    facets=facets
)
result = ctx.search()[0]
agg_ctx = result.aggregation_context().search()

Is this a problem? Will this lead to an incomplete distributed search or is it something I do not need to worry about?

Thank you in advance!

Error with requests_cache dependency

Mark at the MO reported this error when running this code:

from pyesgf.search import SearchConnection
conn = SearchConnection('http://esgf-index1.ceda.ac.uk/esg-search', distrib=True)

ctx = conn.new_context(project='CMIP5', query='humidity')
ctx.hit_count

Error:

/usr/local/lib/python3.7/dist-packages/pyesgf/search/connection.py in
open(self)
     96     def open(self):
     97         if (isinstance(self._passed_session, requests.Session)
or isinstance(
---> 98                 self._passed_session,
requests_cache.core.CachedSession)):
     99             self.session = self._passed_session
    100         else:

AttributeError: module 'requests_cache' has no attribute 'core'

A quick search suggests that the API inside the package has changed.

An older version worked fine:

requests_cache-0.4.1

The new version that failed was:

requests_cache-0.6.4

Needs further investigation.

Remove unused code parts?

Some code parts are never used or mentioned in the docs:

pyesgf/node.py
pyesgf/manifest.py
pyesgf/security/ats.py

Should we remove them?

[Errno -77] NetCDF: Access failure

I dont know exactly what is wrong, but I am trying to use one of the examples and it does not work.
Here is the script the produces the result:

#!/usr/bin/env python3

-- coding: utf-8 --

"""
#############
from pyesgf.logon import LogonManager
lm = LogonManager()
lm.logoff()
lm.is_logged_on()
password = ''
openId = 'https://esgf-data.dkrz.de/esgf-idp/openid/**'
lm.logon_with_openid(openId, password, bootstrap=True)
lm.is_logged_on()

import xarray as xr
url = 'http://esgf2.dkrz.de/thredds/fileServer/lta_dataroot/cmip5/output1/MIROC/MIROC5/rcp45/mon/aerosol/aero/r1i1p1/v20120514/wetss/wetss_aero_MIROC5_rcp45_r1i1p1_200601-210012.nc'
ds = xr.open_dataset(url , chunks={'time': 120})
print(ds)

MyProxyClient integration: response change to bytes causing issues

@philipkershaw: I've been testing the latest pyclient with Python 3 and I'm getting the following error from the test_logon.py tests:

       c = MyProxyClient(hostname=hostname, caCertDir=self.esgf_certs_dir)

        creds = c.logon(username, password,
                        bootstrap=bootstrap,
                        updateTrustRoots=update_trustroots)
        with open(self.esgf_credentials, 'w') as fh:
            for cred in creds:
>               fh.write(cred)
E               TypeError: write() argument must be str, not bytes

It looks like there has been a change in MyProxyClient returning bytes rather than a string. I tried a simple fh.write(str(cred)) to fix it but it didn't work. Any idea what might fix this.

NOTE: Seems to work fine with python2.7.

Dependency problem (version mismatch between pyesgf and requests library) leading to AttributeError: module 'requests_cache' has no attribute 'core'

Hi!
I pip installed pyesgf and also requests, but apparently the versions don't work well together. apparently pyesgf (0.3.0) needs a different version of requests than the one currently available on pip.

I tried this:

myinstance = 'CMIP6.AerChemMIP.BCC.BCC-ESM1.ssp370SST-lowNTCF.r1i1p1f1.Lmon.tsl.gn.v20190612'
conn = SearchConnection(index_search_url, distrib=False)
ctx = conn.new_context(project="CMIP6", instance_id=myinstance)
dset=ctx.search()

files=dset.file_context().search()
i=0
for file in files:
    i += 1
    print('%s : %s' % (i, file.json["instance_id"]))

And run into this error:

Traceback (most recent call last):
  File "corr.py", line 21, in <module>
    dset=ctx.search()
  File "/home/.../venv3/lib/python3.6/site-packages/pyesgf/search/context.py", line 126, in search
    sc.__update_counts(ignore_facet_check=ignore_facet_check)
  File "/home/.../venv3/lib/python3.6/site-packages/pyesgf/search/context.py", line 206, in __update_counts
    response = self.connection.send_search(query_dict, limit=0)
  File "/home/.../venv3/lib/python3.6/site-packages/pyesgf/search/connection.py", line 156, in send_search
    self.open()
  File "/home/.../venv3/lib/python3.6/site-packages/pyesgf/search/connection.py", line 98, in open
    self._passed_session, requests_cache.core.CachedSession)):
AttributeError: module 'requests_cache' has no attribute 'core'

Versions/environment:

Python 3.6.8 (default, Nov 16 2020, 16:55:22) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyesgf
>>> import requests
>>> requests.__version__
'2.26.0'
>>> pyesgf.__version__
'0.3.0'
>>>

If you need any more details, don't hesitate to contact me (or also @cehbrecht ). Thanks!

Trouble on to_netcdf()

Hi there,

I'm currently trying to download subsets of some CMIP6 models data using the esgf-pyclient following the [https://esgf-pyclient.readthedocs.io/en/latest/notebooks/demo/subset-cmip6.html
] examples. It works mostly great, two minors adjustments I had to make were setting 'decode_cf' to 'False' while opening xr, and spatial subsetting (da.sel() doesn't work probably because my data has multidimensional coordinates but manage to get a workaround on it).

However, once I get to the point to extract it to .nc, a simple da.to_netcdf('test.nc') returns me an "AttributeError: NetCDF: String match to name in use" error. I've tried then setting up to netcdf3_classic (as a test)

da.to_netcdf('teste.nc', 'w', 'NETCDF3_CLASSIC')

and it does initially run, creates a file but it breaks down at some point witch don’t surprise me much as netcdf3_classic is not well prepared to handle files over 2GB. Then I get the errors:
...
RuntimeError: NetCDF: Operation not allowed in define mode... During handling of the above exception, another exception occurred: ..... RuntimeError: NetCDF: One or more variable sizes violate format constraints.

Opening the created file makes no sense as the variable of interest comes up full of '--' (nothing just which seems '-' strings but it takes up over 2GB of no data). I've tried too with files I know are under 2GB just as a test and the error I get is "RuntimeError: NetCDF: Access failure" (when using to_netcdf, they are also created but make no sense).

I've looked up through the data before setting up to extract to nc. I'm still learning programming, python and handling netcdf files but I manage to understand a bit. Using dask and data access protocols such as opendap(pydap/netcdf4) however, it's still a bit cloudy for me. I was able to access the references variables values (time, lat, lon, levels) but once I get to my variable of interest values, it just also breaks down, examples below:

vo = da.variables["vo"][:,:,:,:].values # RuntimeError: NetCDF: Access failure
vo = subset.variables['vo'][1,1,:,:].values # it does work, but I'm then able to access all the values of my variable to construct my whole file

I should note if I don't set the decode_cf to False, the error while trying to create the nc is "AttributeError: 'numpy.float64' object has no attribute ‘year’ (just in case someone runs into it too).

Any thoughts?

Latest release (v0.2.2) missing from PyPI and conda forge

It looks like the latest release (v0.2.2) is not available on PyPI and conda forge. Is there a reason for this or did you forget to upload it?

Conda package

esgf-pyclient is used in ESMValTool. We normally install packages via Conda (see the environment.yml file here: https://github.com/ESMValGroup/ESMValTool/blob/REFACTORING_backend/environment.yml

Would it be possible to create a Conda package for esgf-pyclient?

Default batch size

The default batch size of 50, set in search/consts.py : DEFAULT_BATCH_SIZE = 50, makes the response slow. Adjusting it to 5000 gives a much faster response.

add python 3 support

Is there any plans to port this code to python 3? If you want, I could do it but it appears that the authentication might not be easily ported since it relies on MyProxyClient, which is also fairly out of date.

batch_size setting alters the number of results returned by search

I've found an example of a search using pyesgf where changing the batch size changes the number of results although the documentation says: "The batch_size argument does not affect the final result but may affect the speed of the response."

Here's a test that demonstrates the problem:

import unittest

from pyesgf.search import SearchConnection

class TestBatchSize(unittest.TestCase):

    def test_batch_size_has_no_impact_on_results(self):
        conn = SearchConnection(
            'https://esgf-index1.ceda.ac.uk/esg-search', distrib=True)
        ctx = conn.new_context(
            mip_era='CMIP6', institution_id='CCCma', 
            experiment_id='pdSST-pdSIC', table_id='Amon', variable_id='ua')
        results = ctx.search(batch_size=50)
        ids_batch_size_50 = sorted(results, key=lambda x: x.dataset_id)

        ctx = conn.new_context(
            mip_era='CMIP6', institution_id='CCCma', 
            experiment_id='pdSST-pdSIC', table_id='Amon', variable_id='ua')
        results = ctx.search(batch_size=100)
        ids_batch_size_100 = sorted(results, key=lambda x: x.dataset_id)

        self.assertEqual(len(ids_batch_size_50), len(ids_batch_size_100))


if __name__ == '__main__':
    unittest.main()

Not matching all the expected files

The following python code reports finding 38 files, but a query using the web interface finds a dataset with 86 files. Why isn't the python version finding all the files?

from pyesgf.search import SearchConnection

conn = SearchConnection(
    "https://esgf.ceda.ac.uk/esg-search", distrib=True)
ctx = conn.new_context(
    mip_era="CMIP6", source_id="EC-Earth3", experiment_id="ssp370",
    member_id="r1i1p1f1", table_id="Amon", variable_id="pr",
    latest=True)
results = ctx.search(batch_size=1000)
files = results[0].file_context().search()
print(len(files))

add "facets" keyword argument to DatasetResult.file_context

The DatasetResult.file_context function (see results.py) doesn't allow a facets keyword argument, but we might want to set the facets property of the FileSearchContext object that is returned (especially in order to avoid the default facets='*').

Currently we have to monkey-patch it:

fc = result.file_context()
fc.facets = 'project'

but it would be nice to be able to do:

fc = result.file_context(facets='project')

Should be a simple fix to just add the argument and pass it through.

ESGF search API issue

Dear team,

I'm using ESGF API for listing the files. I gave the parameters as input in given below
Project: Cordex
Institute: MPI-CSI
Time Freq: day
Ensemble: r1i1p1
Domain: WAS-44i
Driving Model: MPI-M-MPI-ESM-LR
OpenID:
Password:
esgf-node: https://esgf-node.ipsl.upmc.fr/esg-search

I'm getting error like

"NetCDF: Access failure: http://esg-cccr.tropmet.res.in/thredds/dodsC/esg_dataroot1/cordex_noncommercial/cordex/output/WAS-44/IITM/CCCma-CanESM2/historical/r1i1p1/IITM-RegCM4-4/v5/day/pr/v20160824/pr_WAS-44_CCCma-CanESM2_historical_r1i1p1_IITM-RegCM4-4_v5_day_19510101-19551231.nc"

In above error, It shows the esgf node "http://esg-cccr.tropmet.res.in" and driving model "CCCma-CanESM2" instead of input datanode and driving model. But few days before, It was working fine.

I would like you to help me to solve this issue.

Thank you.

Add automatic notebook tests

Can be done with pytest:
https://pypi.org/project/nbval/

See example in birdy.

The latest version (0.2.1) is not uploaded to pypi

See:
https://pypi.org/project/esgf-pyclient/

Some search contexts are slow - due to facet investigation for each "__update_counts()" call

The internal call inside "context.py" called "__update_counts()" will always add {"facets": "*"} to the query and will, behind the scenes, make a call to refresh the hit count and the available facets. This typically takes 2 seconds to complete.

If you are looping through lots of different contexts this call will make thing slow. Here are two example URLs to demonstrate...

With facet counts (slow):

https://esgf-index1.ceda.ac.uk/esg-search/search?distrib=false&limit=0&format=application%2Fsolr%2Bjson&replica=True&type=Dataset&latest=True&project=CMIP5&variable=tas&cmor_table=Amon&model=FIO-ESM&experiment=historical&facets=%2A

Without facet counts (quick):

Allow user-defined limits for search and wget script generation in SearchContext

Currently when I attempt to generate wget scripts using SearchContext.get_download_script(), setting limit has no effect. There was a comment in the module that said this was a planned feature.

Any reason why this hasn't been implemented yet? This would be useful since the default limit is too small for certain use cases. It seems like something that should be trivial at this point, so correct me if I am wrong.

Use notebooks for examples

... these notebooks can also be rendered in the Sphinx doc:
https://nbsphinx.readthedocs.io/en/0.4.2/

esgf-pyclient review

Clone esgf-pyclient into a sandbox environment
Run the unit tests and check if any need fixing
Make updates to the requests-cache library interface as specified here: #71 (comment)
Test the above with different versions of the requests-cache library to ensure they work with the old and new interface
Update the changelog/history
Create PR and merge to master

Unexpected number of results for large query

I am exploring to use esgf-pyclient to get a list of all retracted CMIP6 datasets (for our automated maintenance of Pangeo CMIP6 cloud data.

I am trying the following:

from pyesgf.search import SearchConnection
conn = SearchConnection(
    'https://esgf-node.llnl.gov/esg-search',
    distrib=True,
)
ctx = conn.new_context(mip_era='CMIP6', retracted=True, replica=False, fields='id', facets=['doi'])
ctx.hit_count

And I get back a hit count of 691984

But when I try to extract a list of instance_ids

results = ctx.search(batch_size=10000)
retracted = [ds.dataset_id for ds in results]
len(retracted)

The list only has 240000 elements. That very even number makes me think that there is some internal limit I am hitting here?

Or did I miss something in the above code?

Any help on this would be greatly appreciated.

logon for http request

Hi everyone,
i have one more thing where I am a little lost. I can access ESGF url via pyesgf and it works fine for me with opendap. However, I access CORDEX datasets regularly which require a logon to ESGF for data access. It works fine with opendap and xarray if i logon and search like, e.g.,

import xarray as xr
import pyesgf
from pyesgf.logon import LogonManager
from pyesgf.search import SearchConnection



lm = LogonManager()

# logon
myproxy_host = 'esgf-data.dkrz.de'
lm.logon(hostname=myproxy_host, interactive=True, bootstrap=True)
print(lm.is_logged_on())

# search
conn = SearchConnection('http://esgf-data.dkrz.de/esg-search', distrib=False)
ctx = conn.new_context(project='CORDEX', experiment='evaluation', time_frequency='mon',
                       variable='tas', driving_model="ECMWF-ERAINT", domain="EUR-11")
result = ctx.search()
print(f"length: {len(result)}")

res = result[0]
ctx = res.file_context()
#ctx.facet_counts
dataset = ctx.search()

download_url = dataset[0].download_url
opendap_url = dataset[0].opendap_url

ds = xr.open_dataset(opendap_url)
ds

I can't access via the download_url, e.g.,

import fsspec
with fsspec.open(download_url, ssl=True) as f:
    ds = xr.open_dataset(f)

which give a 401 Unauthorized error...

---------------------------------------------------------------------------
ClientResponseError                       Traceback (most recent call last)
File /opt/anaconda3/envs/pyesgf/lib/python3.10/site-packages/fsspec/implementations/http.py:391, in HTTPFileSystem._info(self, url, **kwargs)
    389 try:
    390     info.update(
--> 391         await _file_info(
    392             url,
    393             size_policy=policy,
    394             session=session,
    395             **self.kwargs,
    396             **kwargs,
    397         )
    398     )
    399     if info.get("size") is not None:

File /opt/anaconda3/envs/pyesgf/lib/python3.10/site-packages/fsspec/implementations/http.py:772, in _file_info(url, session, size_policy, **kwargs)
    771 async with r:
--> 772     r.raise_for_status()
    774     # TODO:
    775     #  recognise lack of 'Accept-Ranges',
    776     #                 or 'Accept-Ranges': 'none' (not 'bytes')
    777     #  to mean streaming only, no random access => return None

File /opt/anaconda3/envs/pyesgf/lib/python3.10/site-packages/aiohttp/client_reqrep.py:1004, in ClientResponse.raise_for_status(self)
   1003 self.release()
-> 1004 raise ClientResponseError(
   1005     self.request_info,
   1006     self.history,
   1007     status=self.status,
   1008     message=self.reason,
   1009     headers=self.headers,
   1010 )

ClientResponseError: 401, message='401', url=URL('https://cordexesg.dmi.dk/esg-orp/home.htm?redirect=http://cordexesg.dmi.dk/thredds/fileServer/cordex_general/cordex/output/EUR-11/DMI/ECMWF-ERAINT/evaluation/r1i1p1/DMI-HIRHAM5/v1/mon/tas/v20140620/tas_EUR-11_ECMWF-ERAINT_evaluation_r1i1p1_DMI-HIRHAM5_v1_mon_198901-199012.nc')

The above exception was the direct cause of the following exception:

FileNotFoundError                         Traceback (most recent call last)
Input In [5], in <cell line: 2>()
      1 import fsspec
----> 2 with fsspec.open(download_url, ssl=True) as f:
      3     ds = xr.open_dataset(f)

File /opt/anaconda3/envs/pyesgf/lib/python3.10/site-packages/fsspec/core.py:104, in OpenFile.__enter__(self)
    101 def __enter__(self):
    102     mode = self.mode.replace("t", "").replace("b", "") + "b"
--> 104     f = self.fs.open(self.path, mode=mode)
    106     self.fobjects = [f]
    108     if self.compression is not None:

File /opt/anaconda3/envs/pyesgf/lib/python3.10/site-packages/fsspec/spec.py:1037, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
   1035 else:
   1036     ac = kwargs.pop("autocommit", not self._intrans)
-> 1037     f = self._open(
   1038         path,
   1039         mode=mode,
   1040         block_size=block_size,
   1041         autocommit=ac,
   1042         cache_options=cache_options,
   1043         **kwargs,
   1044     )
   1045     if compression is not None:
   1046         from fsspec.compression import compr

File /opt/anaconda3/envs/pyesgf/lib/python3.10/site-packages/fsspec/implementations/http.py:340, in HTTPFileSystem._open(self, path, mode, block_size, autocommit, cache_type, cache_options, size, **kwargs)
    338 kw["asynchronous"] = self.asynchronous
    339 kw.update(kwargs)
--> 340 size = size or self.info(path, **kwargs)["size"]
    341 session = sync(self.loop, self.set_session)
    342 if block_size and size:

File /opt/anaconda3/envs/pyesgf/lib/python3.10/site-packages/fsspec/asyn.py:86, in sync_wrapper.<locals>.wrapper(*args, **kwargs)
     83 @functools.wraps(func)
     84 def wrapper(*args, **kwargs):
     85     self = obj or args[0]
---> 86     return sync(self.loop, func, *args, **kwargs)

File /opt/anaconda3/envs/pyesgf/lib/python3.10/site-packages/fsspec/asyn.py:66, in sync(loop, func, timeout, *args, **kwargs)
     64     raise FSTimeoutError from return_result
     65 elif isinstance(return_result, BaseException):
---> 66     raise return_result
     67 else:
     68     return return_result

File /opt/anaconda3/envs/pyesgf/lib/python3.10/site-packages/fsspec/asyn.py:26, in _runner(event, coro, result, timeout)
     24     coro = asyncio.wait_for(coro, timeout=timeout)
     25 try:
---> 26     result[0] = await coro
     27 except Exception as ex:
     28     result[0] = ex

File /opt/anaconda3/envs/pyesgf/lib/python3.10/site-packages/fsspec/implementations/http.py:404, in HTTPFileSystem._info(self, url, **kwargs)
    401     except Exception as exc:
    402         if policy == "get":
    403             # If get failed, then raise a FileNotFoundError
--> 404             raise FileNotFoundError(url) from exc
    405         logger.debug(str(exc))
    407 return {"name": url, "size": None, **info, "type": "file"}

FileNotFoundError: http://cordexesg.dmi.dk/thredds/fileServer/cordex_general/cordex/output/EUR-11/DMI/ECMWF-ERAINT/evaluation/r1i1p1/DMI-HIRHAM5/v1/mon/tas/v20140620/tas_EUR-11_ECMWF-ERAINT_evaluation_r1i1p1_DMI-HIRHAM5_v1_mon_198901-199012.nc

I would be greatful for any idea of how I can access CORDEX http urls. If I simply click on those http urls and login (in the webportal), I can download the files, e.g., from the browser. However, I have no experience of how to login with an open id in python for http access...

cordex data access

Hi,
the esgf-pyclient really works great for me on CMIP5 and CMIP6 data. However, I have some problems accessing CORDEX data. I have CORDEX_Research data access rights and can successfully login using the pyclient:

import netCDF4 as nc4
from pyesgf.logon import LogonManager
from pyesgf.search import SearchConnection
import pyesgf

print(nc4.__version__)
print(pyesgf.__version__)

lm = LogonManager()

myproxy_host = 'esgf-data.dkrz.de'
lm.logon(hostname=myproxy_host, interactive=True, bootstrap=True)
lm.is_logged_on()

1.5.3
0.3.0
Enter myproxy username: 

 g300046
Enter password for g300046:  ········





True

# search CORDEX project for REMO2015 fx orog variables
conn = SearchConnection('http://esgf-data.dkrz.de/esg-search', distrib=False)
ctx = conn.new_context(project='CORDEX', experiment='evaluation', time_frequency='fx', rcm_name='REMO2015', variable='orog')
result = ctx.search()

orog_url = {}

# loop through search results of datasets
for res in result:
    ctx = res.file_context()
    domain = list(ctx.facet_counts['domain'].keys())[0]
    print('domain: {}'.format(domain))
    # the dataset should contains only one files for fx variables
    dataset = ctx.search()
    filename = dataset[0].opendap_url
    print('filename: {}'.format(filename))
    orog_url[domain] = filename

domain: EUR-11
filename: http://esgf1.dkrz.de/thredds/dodsC/cordex/cordex/output/EUR-11/GERICS/ECMWF-ERAINT/evaluation/r0i0p0/GERICS-REMO2015/v1/fx/orog/v20180813/orog_EUR-11_ECMWF-ERAINT_evaluation_r0i0p0_GERICS-REMO2015_v1_fx.nc
domain: SAM-22
filename: http://esgf1.dkrz.de/thredds/dodsC/cordex/cordex/output/SAM-22/GERICS/ECMWF-ERAINT/evaluation/r0i0p0/GERICS-REMO2015/v1/fx/orog/v20191030/orog_SAM-22_ECMWF-ERAINT_evaluation_r0i0p0_GERICS-REMO2015_v1_fx_r0i0p0.nc
domain: AFR-22
filename: http://esgf1.dkrz.de/thredds/dodsC/cordex/cordex/output/AFR-22/GERICS/ECMWF-ERAINT/evaluation/r0i0p0/GERICS-REMO2015/v1/fx/orog/v20191030/orog_AFR-22_ECMWF-ERAINT_evaluation_r0i0p0_GERICS-REMO2015_v1_fx_r0i0p0.nc
domain: CAM-22
filename: http://esgf1.dkrz.de/thredds/dodsC/cordex/cordex/output/CAM-22/GERICS/ECMWF-ERAINT/evaluation/r0i0p0/GERICS-REMO2015/v1/fx/orog/v20191030/orog_CAM-22_ECMWF-ERAINT_evaluation_r0i0p0_GERICS-REMO2015_v1_fx_r0i0p0.nc
domain: EAS-22
filename: http://esgf1.dkrz.de/thredds/dodsC/cordex/cordex/output/EAS-22/GERICS/ECMWF-ERAINT/evaluation/r0i0p0/GERICS-REMO2015/v1/fx/orog/v20191030/orog_EAS-22_ECMWF-ERAINT_evaluation_r0i0p0_GERICS-REMO2015_v1_fx_r0i0p0.nc
domain: EUR-22
filename: http://esgf1.dkrz.de/thredds/dodsC/cordex/cordex/output/EUR-22/GERICS/ECMWF-ERAINT/evaluation/r0i0p0/GERICS-REMO2015/v1/fx/orog/v20191030/orog_EUR-22_ECMWF-ERAINT_evaluation_r0i0p0_GERICS-REMO2015_v1_fx_r0i0p0.nc
domain: SEA-22
filename: http://esgf1.dkrz.de/thredds/dodsC/cordex/cordex/output/SEA-22/GERICS/ECMWF-ERAINT/evaluation/r0i0p0/GERICS-REMO2015/v1/fx/orog/v20191030/orog_SEA-22_ECMWF-ERAINT_evaluation_r0i0p0_GERICS-REMO2015_v1_fx_r0i0p0.nc
domain: WAS-22
filename: http://esgf1.dkrz.de/thredds/dodsC/cordex/cordex/output/WAS-22/GERICS/ECMWF-ERAINT/evaluation/r0i0p0/GERICS-REMO2015/v1/fx/orog/v20191030/orog_WAS-22_ECMWF-ERAINT_evaluation_r0i0p0_GERICS-REMO2015_v1_fx_r0i0p0.nc
domain: AUS-22
filename: http://esgf1.dkrz.de/thredds/dodsC/cordex/cordex/output/AUS-22/GERICS/ECMWF-ERAINT/evaluation/r0i0p0/GERICS-REMO2015/v1/fx/orog/v20191030/orog_AUS-22_ECMWF-ERAINT_evaluation_r0i0p0_GERICS-REMO2015_v1_fx_r0i0p0.nc
domain: CAS-22
filename: http://esgf1.dkrz.de/thredds/dodsC/cordex/cordex/output/CAS-22/GERICS/ECMWF-ERAINT/evaluation/r0i0p0/GERICS-REMO2015/v1/fx/orog/v20191030/orog_CAS-22_ECMWF-ERAINT_evaluation_r0i0p0_GERICS-REMO2015_v1_fx_r0i0p0.nc

orog_url.keys()

dict_keys(['EUR-11', 'SAM-22', 'AFR-22', 'CAM-22', 'EAS-22', 'EUR-22', 'SEA-22', 'WAS-22', 'AUS-22', 'CAS-22'])

url = orog_url['EUR-11']
url

'http://esgf1.dkrz.de/thredds/dodsC/cordex/cordex/output/EUR-11/GERICS/ECMWF-ERAINT/evaluation/r0i0p0/GERICS-REMO2015/v1/fx/orog/v20180813/orog_EUR-11_ECMWF-ERAINT_evaluation_r0i0p0_GERICS-REMO2015_v1_fx.nc'

This works all fine until I actually want to access the data:

# netcdf4 engine
ds = nc4.Dataset(url)

---------------------------------------------------------------------------

OSError                                   Traceback (most recent call last)

<ipython-input-8-fbb4748a9677> in <module>()
      1 # netcdf4 engine
----> 2 ds = nc4.Dataset(url)


netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()


netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()


OSError: [Errno -68] NetCDF: I/O failure: b'http://esgf1.dkrz.de/thredds/dodsC/cordex/cordex/output/EUR-11/GERICS/ECMWF-ERAINT/evaluation/r0i0p0/GERICS-REMO2015/v1/fx/orog/v20180813/orog_EUR-11_ECMWF-ERAINT_evaluation_r0i0p0_GERICS-REMO2015_v1_fx.nc'

With CMIP5 data everything works fine, e.g,:

# check with CMIP5 data, this works fine.
url = "http://esgf1.dkrz.de/thredds/dodsC/cmip5/cmip5/output1/MPI-M/MPI-ESM-LR/historical/fx/atmos/fx/r0i0p0/v20120315/orog/orog_fx_MPI-ESM-LR_historical_r0i0p0.nc"

ds = nc4.Dataset(url)
ds

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF3_CLASSIC data model, file format DAP2):
    institution: Max Planck Institute for Meteorology
    institute_id: MPI-M
    experiment_id: historical
    source: MPI-ESM-LR 2011; URL: http://svn.zmaw.de/svn/cosmos/branches/releases/mpi-esm-cmip5/src/mod; atmosphere: ECHAM6 (REV: 4603), T63L47; land: JSBACH (REV: 4603); ocean: MPIOM (REV: 4603), GR15L40; sea ice: 4603; marine bgc: HAMOCC (REV: 4603);
    model_id: MPI-ESM-LR
    forcing: GHG,Oz,SD,Sl,Vl,LU
    parent_experiment_id: piControl
    parent_experiment_rip: r1i1p1
    branch_time: 10957.0
    contact: [email protected]
    history: Model raw output postprocessing with modelling environment (IMDI) at DKRZ: URL: http://svn-mad.zmaw.de/svn/mad/Model/IMDI/trunk, REV: 4201 2012-01-13T07:51:03Z CMOR rewrote data to comply with CF standards and CMIP5 requirements.
    references: ECHAM6: n/a; JSBACH: Raddatz et al., 2007. Will the tropical land biosphere dominate the climate-carbon cycle feedback during the twenty first century? Climate Dynamics, 29, 565-574, doi 10.1007/s00382-007-0247-8;  MPIOM: Marsland et al., 2003. The Max-Planck-Institute global ocean/sea ice model with orthogonal curvilinear coordinates. Ocean Modelling, 5, 91-127;  HAMOCC: Technical Documentation, http://www.mpimet.mpg.de/fileadmin/models/MPIOM/HAMOCC5.1_TECHNICAL_REPORT.pdf;
    initialization_method: 0
    physics_version: 0
    tracking_id: d9bbcbd4-c852-4bd0-a3b4-0fccb598f23c
    product: output
    experiment: historical
    frequency: fx
    creation_date: 2012-01-13T07:51:03Z
    Conventions: CF-1.4
    project_id: CMIP5
    table_id: Table fx (26 July 2011) 491518982c8d8b607a58ba740689ea09
    title: MPI-ESM-LR model output prepared for CMIP5 historical
    parent_experiment: pre-industrial control
    modeling_realm: atmos
    realization: 0
    cmor_version: 2.6.0
    dimensions(sizes): bnds(2), lat(96), lon(192)
    variables(dimensions): float64 lat(lat), float64 lat_bnds(lat,bnds), float64 lon(lon), float64 lon_bnds(lon,bnds), float32 orog(lat,lon)
    groups:

I know, that this is no esgf-pyclient issue but I wonder how the logon would work. I suspect it's a problem with me logging onto ESGF via python (I can logon also on the web interface of ESGF and download CORDEX data without a probem). It would be really nice for me to have access to the opendap urls via python, too. Thanks a lot!

Compatibility of LogonManager with OpenSSL 1.1.1e

The following issue occurs with OpenSSL 1.1.1e but goes away if I downgrade to 1.1.1d. It seems that other users of OpenSSL are reporting similar issues (e.g. openssl/openssl#11381).

As an interim measure, I suggest that specifying a dependency on OpenSSL=1.1.1d.

from pyesgf.logon import LogonManager

lm = LogonManager()

# Error trace: 
      4 openid = "MY_OPENID"
      5 password = "MY_PASSWORD"
----> 6 lm.logon_with_openid(openid=openid, password=password, bootstrap=True)
      7 lm.is_logged_on()

~/.conda/envs/research/lib/python3.8/site-packages/pyesgf/logon.py in logon_with_openid(self, openid, password, bootstrap, update_trustroots, interactive)
    144         """
    145         username, myproxy = self._get_logon_details(openid)
--> 146         return self.logon(username, password, myproxy,
    147                           bootstrap=bootstrap,
    148                           update_trustroots=update_trustroots,

~/.conda/envs/research/lib/python3.8/site-packages/pyesgf/logon.py in logon(self, username, password, hostname, bootstrap, update_trustroots, interactive)
    181         c = MyProxyClient(hostname=hostname, caCertDir=self.esgf_certs_dir)
    182 
--> 183         creds = c.logon(username, password,
    184                         bootstrap=bootstrap,
    185                         updateTrustRoots=update_trustroots)

~/.conda/envs/research/lib/python3.8/site-packages/myproxy/client/__init__.py in logon(self, username, passphrase, credname, lifetime, keyPair, certReq, nBitsForKey, bootstrap, updateTrustRoots, authnGetTrustRootsCall, sslCertFile, sslKeyFile, sslKeyFilePassphrase)
   1451                 getTrustRootsKw = {}
   1452 
-> 1453             self.getTrustRoots(writeToCACertDir=True,
   1454                                bootstrap=bootstrap,
   1455                                **getTrustRootsKw)

~/.conda/envs/research/lib/python3.8/site-packages/myproxy/client/__init__.py in getTrustRoots(self, username, passphrase, writeToCACertDir, bootstrap)
   1622         try:
   1623             for tries in range(self.MAX_RECV_TRIES):
-> 1624                 dat += conn.recv(self.SERVER_RESP_BLK_SIZE)
   1625         except SSL.SysCallError:
   1626             # Expect this exception when response content exhausted

~/.conda/envs/research/lib/python3.8/site-packages/OpenSSL/SSL.py in recv(self, bufsiz, flags)
   1807         else:
   1808             result = _lib.SSL_read(self._ssl, buf, bufsiz)
-> 1809         self._raise_ssl_error(self._ssl, result)
   1810         return _ffi.buffer(buf, result)[:]
   1811     read = recv

~/.conda/envs/research/lib/python3.8/site-packages/OpenSSL/SSL.py in _raise_ssl_error(self, ssl, result)
   1669             pass
   1670         else:
-> 1671             _raise_current_error()
   1672 
   1673     def get_context(self):

~/.conda/envs/research/lib/python3.8/site-packages/OpenSSL/_util.py in exception_from_error_queue(exception_type)
     52             text(lib.ERR_reason_error_string(error))))
     53 
---> 54     raise exception_type(errors)
     55 
     56 

Error: [('SSL routines', 'ssl3_read_n', 'unexpected eof while reading')]

Accessing opendap datasets

I am working on what I think is a fairly common workflow:

log on to ESGS using the LogonManager class
search for some datasets using the SearchConnection class
access some opendap dataset using netcdf4-python or pydap

Here's an example workflow:

In [1]: openid = 'https://esgf-node.llnl.gov/esgf-idp/openid/SECRET'
   ...: password = 'SECRET'
   ...:

In [2]: from pyesgf.logon import LogonManager
   ...: from pyesgf.search import SearchConnection
   ...: import xarray as xr
   ...:

In [3]: # intialize the logon manager
   ...: lm = LogonManager(verify=True)
   ...: if not lm.is_logged_on():
   ...:     lm.logon_with_openid(openid, password, 'pcmdi9.llnl.gov')
   ...: lm.is_logged_on()
   ...:
Out[3]: True

In [4]: def print_context_info(ctx):
   ...:     print('Hits:', ctx.hit_count)
   ...:     print('Realms:', ctx.facet_counts['experiment'])
   ...:     print('Realms:', ctx.facet_counts['realm'])
   ...:     print('Ensembles:', ctx.facet_counts['ensemble'])
   ...:

In [5]: # search for some data
   ...: conn = SearchConnection('http://pcmdi9.llnl.gov/esg-search', distrib=Tru
   ...: e)
   ...: ctx = conn.new_context(project='CMIP5', model='CCSM4', experiment='rcp85
   ...: ', time_frequency='day')
   ...: ctx = ctx.constrain(realm='atmos', ensemble='r1i1p1')
   ...:
   ...: # print a summary of what we found
   ...: print_context_info(ctx)
   ...:
Hits: 4
Realms: {'rcp85': 4}
Realms: {'atmos': 4}
Ensembles: {'r1i1p1': 4}

In [6]: # aggregate results
   ...: result = ctx.search()[0]
   ...: agg_ctx = result.aggregation_context()
   ...:
   ...: # get a list of opendap urls
   ...: x = list(a.opendap_url for a in agg_ctx.search() if a.opendap_url)
   ...: x
   ...:
Out[6]:
['http://aims3.llnl.gov/thredds/dodsC/cmip5.output1.NCAR.CCSM4.rcp85.day.atmos.day.r1i1p1.tasmin.20120705.aggregation.1',
 'http://aims3.llnl.gov/thredds/dodsC/cmip5.output1.NCAR.CCSM4.rcp85.day.atmos.day.r1i1p1.tasmax.20120705.aggregation.1',
 'http://aims3.llnl.gov/thredds/dodsC/cmip5.output1.NCAR.CCSM4.rcp85.day.atmos.day.r1i1p1.prc.20120705.aggregation.1',
 'http://aims3.llnl.gov/thredds/dodsC/cmip5.output1.NCAR.CCSM4.rcp85.day.atmos.day.r1i1p1.psl.20120705.aggregation.1',
 'http://aims3.llnl.gov/thredds/dodsC/cmip5.output1.NCAR.CCSM4.rcp85.day.atmos.day.r1i1p1.tas.20120705.aggregation.1',
 'http://aims3.llnl.gov/thredds/dodsC/cmip5.output1.NCAR.CCSM4.rcp85.day.atmos.day.r1i1p1.pr.20120705.aggregation.1']

In [7]: # try opening one of the opendap datasets
   ...: xr.open_dataset(x[0], engine='pydap')
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-7-90d39efb83f7> in <module>()
      1 # try opening one of the opendap datasets
----> 2 xr.open_dataset(x[0], engine='pydap')

~/anaconda/envs/aist/lib/python3.6/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables)
    302                                             autoclose=autoclose)
    303         elif engine == 'pydap':
--> 304             store = backends.PydapDataStore.open(filename_or_obj)
    305         elif engine == 'h5netcdf':
    306             store = backends.H5NetCDFStore(filename_or_obj, group=group,

~/anaconda/envs/aist/lib/python3.6/site-packages/xarray/backends/pydap_.py in open(cls, url, session)
     75     def open(cls, url, session=None):
     76         import pydap.client
---> 77         ds = pydap.client.open_url(url, session=session)
     78         return cls(ds)
     79

~/anaconda/envs/aist/lib/python3.6/site-packages/pydap/client.py in open_url(url, application, session, output_grid)
     62     never retrieve coordinate axes.
     63     """
---> 64     dataset = DAPHandler(url, application, session, output_grid).dataset
     65
     66     # attach server-side functions

~/anaconda/envs/aist/lib/python3.6/site-packages/pydap/handlers/dap.py in __init__(self, url, application, session, output_grid)
     62
     63         # build the dataset from the DDS and add attributes from the DAS
---> 64         self.dataset = build_dataset(dds)
     65         add_attributes(self.dataset, parse_das(das))
     66

~/anaconda/envs/aist/lib/python3.6/site-packages/pydap/parsers/dds.py in build_dataset(dds)
    159 def build_dataset(dds):
    160     """Return a dataset object from a DDS representation."""
--> 161     return DDSParser(dds).parse()
    162
    163

~/anaconda/envs/aist/lib/python3.6/site-packages/pydap/parsers/dds.py in parse(self)
     47         dataset = DatasetType('nameless')
     48
---> 49         self.consume('dataset')
     50         self.consume('{')
     51         while not self.peek('}'):

~/anaconda/envs/aist/lib/python3.6/site-packages/pydap/parsers/dds.py in consume(self, regexp)
     39     def consume(self, regexp):
     40         """Consume and return a token."""
---> 41         token = super(DDSParser, self).consume(regexp)
     42         self.buffer = self.buffer.lstrip()
     43         return token

~/anaconda/envs/aist/lib/python3.6/site-packages/pydap/parsers/__init__.py in consume(self, regexp)
    180             self.buffer = self.buffer[len(token):]
    181         else:
--> 182             raise Exception("Unable to parse token: %s" % self.buffer[:10])
    183         return token

Exception: Unable to parse token:

Questions:

Is this actually a workflow that should work?
Does this opendap URL actually exist? What is the best way to test that an opendap url from esgf is a valid one?
Is additional authentication required?

Python3 compatibility in wget scripts

Hi all,

Thanks for maintaining this library. Really useful in my day to day. I wanted to raise an issue since I don't think Python3 is 100% compatible yet. When generating a wget script, I can only run it from my shell if my native Python is 2.7. This is a bit of pain when working in conda and needing to create a Python2.7 environment to run scripts written using Python3.

The error raised when running the generated wget scripts with Python3 is as follows:

File "<stdin>", line 18
    print "-s %s -p %s -l %s" % (host, port, username)
                            ^
SyntaxError: invalid syntax

Is this a known issue? Are there plans to address this issue? Is this user error? Please let me know.

Cheers,

CMIP6 data availability?

Hi, the documentation indicates that this package should work with CMIP6. However, when I attempt the following:

conn = SearchConnection('https://esgf-node.llnl.gov/esg-search')
ctx = conn.new_context(project='CMIP6', experiment='past1000', variable='tas')
print('Hits: {}, Realms: {}, Ensembles: {}'.format(
    ctx.hit_count,
    ctx.facet_counts['realm'],
    ctx.facet_counts['ensemble']))
print(ctx.get_facet_options())

I get different results than searching through the web GUI at https://esgf-node.llnl.gov/search/cmip6/. The CMIP6 data guide (https://pcmdi.llnl.gov/CMIP6/Guide/dataUsers.html) directs me back to the RESTful API (https://esgf.github.io/esgf-user-support/user_guide.html#the-esgf-search-restful-api) which is providing incomplete results.

Does anyone know the source of this issue or what alternatives exist for bulk downloads of filtered CMIP6 data? The wget script generator has been inconsistent & I've found myself missing datasets. Downloading search results as JSON is limited to 100 results at a time. Is this potentially due to the THREADDs catalog being down? The CEDA node alternatively notes CMIP6 data is still in the process of being added to its FTP catalog, so that is out too.

Sorry if I am missing something obvious. If anyone has leads it would be much appreciated!

logon does not allow access to all ESGF nodes

I have a simple example, i logon with my openid to ESGF:

OPENID = 'https://esgf-data.dkrz.de/esgf-idp/openid/<user>'
lm.logon_with_openid(openid=OPENID, interactive=True, bootstrap=True)
lm.is_logged_on()

That works fine, however, i still get 401 if i want to access a dataset on esgf.dwd.de:

import xarray as xr
xr.open_dataset("https://esgf.dwd.de/thredds/dodsC/esgf2_1/cordex/output/EUR-11/CLMcom/MIROC-MIROC5/rcp26/r1i1p1/CLMcom-CCLM4-8-17/v1/mon/tas/v20180707/tas_EUR-11_MIROC-MIROC5_rcp26_r1i1p1_CLMcom-CCLM4-8-17_v1_mon_200601-201012.nc", engine="pydap")

HTTPError: 401 401

Shouldn't my openid grant general access to all ESGF servers? I have also problems accessing other servers, e.g., only esgf-data.dkrz.de seems to be stable. Actually, i can go to the web interface, logon, and access that URL, so that works fine. It fails also with the netcdf4 engine...