Comments (4)
This appears to be a problem with the ESGF search results, if you specify facets=*
, you appear to get only search results from the local node, while if you specify the facet you're interested in (facets=project
), you get more complete results?
>>> pyesgf.search.SearchConnection(url='https://esgf-data.dkrz.de/esg-search', distrib=True).new_context().facet_counts['project']
{'wind': 1, 'uerra': 2, 'tracmip': 6767, 'reklies-index': 28792, 'obs4MIPs': 2, 'monthlyfc': 2710, 'input4mips': 5832, 'hiresireland': 66, 'TEST': 4, 'TAMIP': 192, 'PMIP3': 16, 'MiKlip': 5568, 'MPI-GE': 55111, 'LUCID': 112, 'CORDEX-Reklies': 7017, 'CORDEX-ESD': 1370, 'CORDEX': 67908, 'CMIP6': 882140, 'CMIP5': 53725}
>>> pyesgf.search.SearchConnection(url='https://esgf-data.dkrz.de/esg-search', distrib=True).new_context(facets=['project']).facet_counts['project']
{'wind': 1, 'uerra': 2, 'tracmip': 6767, 'specs': 446693, 'reklies-index': 28792, 'psipps': 1, 'primavera': 6947, 'obs4MIPs': 210, 'monthlyfc': 2710, 'input4mips': 11492, 'input4MIPs': 201, 'hiresireland': 66, 'eucleia': 1921, 'e3sm-supplement': 53, 'e3sm': 815, 'cmip3': 71, 'clipc': 114, 'cc4e': 497, 'c3se': 184, 'c3s-cmip5-adjust': 188, 'ana4MIPs': 14, 'TEST': 5, 'TAMIP': 1536, 'PMIP3': 361, 'NEXGDDP': 3, 'NEX': 10, 'NARR_Hydrology': 85, 'MiKlip': 5568, 'MPI-GE': 55111, 'LUCID': 318, 'ISIMIP3b': 550, 'ISIMIP3a': 111, 'ISIMIP2b': 95963, 'ISIMIP2a': 13803, 'ISIMIP2 Phase a': 288, 'ISI-MIP Fast Track': 856, 'GeoMIP': 757, 'EUCLIPSE': 41, 'CREATE-IP': 114, 'CORDEX-Reklies': 7017, 'CORDEX-ESD': 1370, 'CORDEX-Adjust': 1221, 'CORDEX': 183980, 'CMIP6': 11249567, 'CMIP5': 201129, 'CMIP3': 29331, 'CDAT-sample': 1, 'BioClim': 2, 'ACME': 23}
It might be nice to add some information about this behavior in the documentation, because this is not very intuitive.
from esgf-pyclient.
This is essentially a server-side issue. Since CMIP6, the number of facet values has increased substantially because a few of the facets have potentially a very large number of possible values (sorry I forget exactly which ones), and if you include facets=*
in the search then the server does not seem to do a proper distributed search across all the shards, with the consequence that results are missing. There was a workaround implemented on one or two of the index nodes, including DKRZ, to exclude certain known problematic facets from the facets='*', but this seems to be no longer the case. Anyway, the point here is that it has nothing to do with pyclient, and exactly the same behaviour is observed using for example curl.
$ url_stem='https://esgf-data.dkrz.de/esg-search/search/?limit=0&type=Dataset&project=CMIP6'
$ curl -s "$url_stem&facets=project%2Cproduct%2Cdata_node" | grep numFound
<result name="response" numFound="11307457" start="0" maxScore="1.0">
$ curl -s "$url_stem&facets=*" | grep numFound
<result name="response" numFound="886069" start="0" maxScore="1.0">
Now the behaviour in esgf-pyclient is to default to facets=*
and the problem is that changing this would break backward compatibility. But we can issue a warning in the code.
I have pushed a feature branch with a commit to implement such a warning, and will issue a pull-request cross-referenced to this issue, but I do not want to merge to master myself. Somebody needs to review this and see if it is desirable.
from esgf-pyclient.
Thanks for looking into this! If the warning in the code is not desirable, a warning in the documentation would already be great.
from esgf-pyclient.
Branch add_facets_star_warnings
described above now also contains doc string changes which will get incorporated in the Sphinx documentation.
Hoping this can be merged soon, after I have sorted out a couple of CI issues.
from esgf-pyclient.
Related Issues (20)
- Search API documentation not available on readthedocs
- Latest release (v0.2.2) missing from PyPI and conda forge HOT 6
- cordex data access HOT 1
- Check tests work on master, and assess PR: 68 HOT 3
- Error with requests_cache dependency HOT 5
- esgf-pyclient review HOT 4
- `ignore_facet_check` search option appears to be broken
- Dependency problem (version mismatch between pyesgf and requests library) leading to AttributeError: module 'requests_cache' has no attribute 'core' HOT 5
- add "facets" keyword argument to DatasetResult.file_context
- New release? HOT 5
- Import trial of `MyProxyClient` in `pyesgf/logon.py` outputs misleading error and incompatibility with `cryptography` from Anaconda `main` channel HOT 2
- logon for http request HOT 2
- lm.logon timeout HOT 12
- Facets warning with aggregation_context(): unexpected keyword
- CMIP6 data availability? HOT 2
- logon does not allow access to all ESGF nodes HOT 13
- Unexpected number of results for large query
- Not matching all the expected files HOT 4
- logon refused from my laptop HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from esgf-pyclient.