nasa / python_cmr Goto Github PK

View Code? Open in Web Editor NEW

22.0 4.0 20.0 1.26 MB

Python library for querying the common metadata repository.

License: MIT License

Python 100.00%

python_cmr's Issues

Support searching for collection by processing_level

See https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#c-processing-level-id

how to identify the different types of data links from a GranuleQuery

How can I get a list of data, S3, or opendap urls from a GranuleQuery?

I currently do the following to get the different types of URLs:

First, get the list of all entry URL links returned by granule metadata

all_links = [entry['links'] for entry in gran_met]

Next, get just the granule URL links, exclude inherited links (i.e. from collection metadata).

gran_links = [link for group in all_links for link in group if 'inherited' not in link.keys()]

From there I can search for string in the 'title' or 'rel' to identify the data, S3, or opendap urls? Is there another way to get a list of the different types of urls?

CI: poetry build cache key should include Python version

The key value for the poetry cache should include the Python version so that each matrix combo has a distinct cache: https://github.com/nasa/python_cmr/blob/develop/.github/workflows/python-app.yml#L48

Support searching by multiple platforms

Currently, only a single platform may be supplied via the platform method. However, CMR supports searching for multiple platforms, so the platform method should be updated to accept either a single string or multiple strings, both for collections and granules, although this is not explicitly described for granules:

Query request headers are clobbered by token methods

The token and bearer_token methods of the Query class overwrite any headers that might otherwise be set on a query instance. Any methods that add a header should not completely replace all headers.

Add type annotations

Arguably, type annotations are extremely helpful and have become quite common, either directly within Python libraries, or within type stub packages (such as those available on typeshed).

Since typing was introduced in Python 3.5 (IIRC), adding type annotations should likely be done after #35 is implemented, so that it is clear that this library supports only versions of Python no older than 3.5 (although likely no older than 3.8, as 3.8 is even close to "retirement").

The motivation here is not only as part of "modernizing" this library, but also because the lack of type annotations here has led to defining type stubs for this library within the earthaccess library, which uses this library. See nsidc/earthaccess#508. I would like to use those type stubs as a basis for implementing a solution to this issue.

Discussion on release 0.12.0

We have a few unreleased changes ready.

There is 1 open PR waiting for minor changes to be applied.

Once the PR is ready I think we can release the next version. Based on the changes described here I believe the next version can be a minor increment to 0.12.0.

Use CMR Search After functionality for retrieving more than 2000 results

Implement API interaction as per https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#paging-details.
This will improve performance for the user and decrease the burden on CMR for large result sets.

Feature: CMR Concept ID <-> Human Readible Name translation tool

Hi python-cmr folks,

I'm a developer at GES DISC, and during a hackweek last week my colleague @eni-awowale created a CLI tool (tentatively named CNR: CMR Name Resolver) that translates concept_id's to native_ids (and vise-versa). It's not publically shareable at this point but if you have access to Earthdata bitbucket you can see our repo here: https://git.earthdata.nasa.gov/projects/GDDS/repos/cnr/browse

After discussion with our manager @briannapagan we were wondering if this would be something you'd be interested in merging into python-cmr. Before opening any PRs I wanted to start a discussion on this to make sure 1) this is something you all are interested in having in this project and 2) we're on the same page for how to implement.

The tool is a python script that gets installed as a CLI tool in the ./bin directory of your python install by using [project.scripts] in pyproject.toml. Example usage:

$ cnr M2T1NXSLV_5.12.4 
prod    C1276812863-GES_DISC
uat    C1245662776-EEDTEST
uat    C1215802944-GES_DISC

$ cnr -e prod M2T1NXSLV_5.12.4 
prod    C1276812863-GES_DISC

$ cnr C1215664070-GES_DISC           
uat	GLDAS_NOAH10_3H_2.0

$ cnr -p MERRA2_100.inst1_2d_asm_Nx.19800101.nc4 -p GES_DISC
prod	G1276974976-GES_DISC
uat	G1256129374-GES_DISC
sit	None

The full list of aceeptable arguments:

$ cnr -h
usage: cnr [-h] [-e {prod,sit,uat}] [-p PROVIDER] [-c] [-g] [-s] id

Get a concept-id or native-id given either

positional arguments:
  id                    Collection ID or Native ID to convert

options:
  -h, --help            show this help message and exit
  -e {prod,sit,uat}, --environment {prod,sit,uat}
                        Environment to use
  -p PROVIDER, --provider PROVIDER
                        Used for a granule and collection query in CMR with a provider
                        id
  -c, --collection      Used to denote a collection query in CMR
  -g, --granule         Used to denote a granule query in CMR
  -s, --service         Used to denote a service query in CMR

Add method for setting parameter options

Parameter options can currently be set directly on a query instance, like so:

query.options[parameter] = {option_name: True}  # or False

However, this is undocumented, unintuitive, and error-prone.

Given that there are only specific parameter options available, I suggest providing a method (on the Query class) per option. For example, for the ignore_case parameter option:

def option_ignore_case(self, parameter: str, value: bool = True) -> Self:
    options[parameter] = {"ignore_case": value}
    return self

This would allow calls like query.option_ignore_case("entry_title") to ignore case for search by entry title, for example.

Implement methods option_pattern, option_and, and option_or similarly.

remove >>> from readme for easy copy paste

Add vcrpy cassettes for revision_date unit tests

New unit tests recently added for testing new revision_date search functionality are missing vcrpy cassettes to prevent unit tests from making live requests. This can lead to broken tests in the future, should the responses change.

Specifically, these tests need to be supported by vcrpy cassettes:

test_revision_date in tests/test_collection.py
test_revision_date in tests/test_granule.py

Support multiple-point searches in queries

The current point function support's a single point only. Would be nice to have support for multiple points

Support searching for collection and granule by platform

See https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#c-platform

This is already implemented for granules:

python_cmr/cmr/queries.py

Line 816 in fb58457

def platform(self, platform: str) -> Self:

So moving it into GranuleCollectionBaseQuery should be sufficient

Return query results as an iterator

To give users better control over memory consumption, particularly when executing queries that produce large search results (hundreds of thousands, or even millions of granules), query results should not be obtained in their entirety and put into a list all at once before returning the results to the user. This can easily lead to "out of memory" errors for large results. Instead, an iterator (generator) should be returned so the user has the option to limit the number of results held in memory at any given moment.

Since this would be a breaking change to the existing get and get_all query methods, it might be worth defining a new method. However, given that this library has not reached a 1.0 release, making such a breaking change would not be unreasonable. Alternatively, a new method could be added while deprecating the existing ones. Either way, I recommend that returning an iterator being the only option, not an additional option, otherwise it can be confusing to users as to why there are multiple options. This also follows the Zen of Python's "There should be one-- and preferably only one --obvious way to do it." Further, returning an iterator does not prevent the user from populating a list themselves, if they wish.

There has been discussion amongst the users and maintainers of the earthaccess library about this very topic, and a number of us agreed that it makes more sense to implement this here rather than in earthaccess, which simply calls through to this library.

Support searching for available zarr stores

At GES DISC we are getting ready to expose our production public zarr stores. Lots of open questions remain on how to make these zarr stores easily searchable, especially because we publish zarr stores at the variable, not the collection level.

In umm_json we are specifying zarr as the instance_information, example here:
https://cmr.earthdata.nasa.gov/search/variables.umm_json?instance-format=zarr&provider=GES_DISC

I think python_cmr so far hasn't brought over many of the functionalities in the CMR API for VariableQuery and I would like to add a piece of code to help make searching for variable zarr stores more intuitive to the user:

import re
from cmr import VariableQuery 
def get_all_zarr_stores(provider = None):
    api = VariableQuery()
    if provider: 
        all_vars = api.provider(provider).get_all()
    else:
        all_vars = api.get_all()
    zarrs = []
    for variable_entry in all_vars:
        try:
            if variable_entry['instance_information']:
                zarrs.append(variable_entry)
        except KeyError:
            continue
    return zarrs

def query_zarr_stores(zarr_stores,short_name, version, variable=None):
    zarrs = []
    if variable:
        pattern = short_name + '.*' + version + '.*' + variable + '.*'
    else:         
        pattern = short_name + '.*' + version + '.*'

    for store in zarr_stores:
        try:
            if re.match(pattern, store["native_id"]):
                zarrs.append(store)
        except KeyError:
            continue
    return zarrs


# one way to search just knowing provider
provider = "GES_DISC"
zarr_stores = get_all_zarr_stores(provider = provider)

# another way (perhaps more intuitive) where user can input short_name, version and variable
short_name = "GPM_3IMERGHH"
version = "06"
variable = "precipitationCal"
zarr_stores = query_zarr_stores(zarr_stores,short_name,version)

Any thoughts/feedback before I suggest a PR to add instance_information as an additional query parameter in VariableQuery?

Search on revision_date

Add option to search for granules by revision_date.

Add revision_date function to Query class (need to verify revision_date is not specific to only some concepts, I think it applies to all CMR concepts)
Add test to verify documents are included/excluded based on their revision_date
Add example usage to README
Update CHANGELOG

Feature: Check for correct WKT coordinate ordering

I was using earthaccess to search for some data, and I noticed that when I accidentally flipped the left/right coordinates in the bounding box, instead of an error being raised, it formed a polygon stretching the ~179 degrees around the Earth in the other direction.

The asf_search library raises a warning (and also repairs the winding direction) of your WKT when this happens, which is how I caught the mistake. Perhaps raising a warning that you may be entering the wrong coordinates during a search could be helpful?

add CMR graphql functionalities

NASA has created a GraphQL interface for accessing CMR: https://github.com/nasa/cmr-graphql

Some of the advantages of this way to query are supposed to be 1) performance and 2) the ability to search across different schemas, which can be a huge pain point for users of the regular CMR API. Internally we are beginning to pass code around to demonstrate how to pythonically call CMR-GraphQL, and I have been wondering whether those of us maintaining python_cmr should invest time to bring the ability to query from GraphQL as an option within python_cmr.

improve ISO 8601 parsing for temporal filter

The earthaccess package has had a couple issues raised whose resolutions could be enhancements for the python-cmr package.

I started to implement handling 1) timezone aware datetime objects and 2) ISO-8601 strings that are not specified down to the second over on the earthaccess package. Some of the earthaccess community (@mfisher87, @jhkennedy) suggested we offer our resolution upstream. I have a draft PR (to be linked below) and would appreciate feedback here from both projects.

Provide convenience method for adding a client-id header

The Headers section of the CMR Search API documentation describes the Client-Id header as follows:

Client-Id - Indicates a name for the client using the CMR API. Specifying this helps Operations monitor query performance per client. It can also make it easier for them to identify your requests if you contact them for assistance.

Given that there are token and bearer_token methods in the Query class for conveniently setting the Authorization header, providing a client_id method would be convenient for supplying the Client-Id header.

Further, I recommend the following:

When the user does not invoke the client_id method to specify their own chosen ID for identification for assistance, a default should be set. I recommend something like this: python_cmr-vX.Y.Z, where X.Y.Z is the version of python_cmr
When the user does invoke the client_id method, the following suffix should be added to the value specified by the user: (python_cmr-vX.Y.Z), with a space character between the user-supplied value and the suffix.

Modernize Python support

Many versions of Python have been released (and many have become unsupported) since this repo was first created, and there are at least a couple of places where I've noticed either an unspecified minimum required Python version, or code that is meant to accommodate some very old Python versions.

For example, the classifiers block in pyproject.toml does not list any specific Python versions. I suggest this block be updated appropriately, including a minimum no less than Python 3.8. In fact, the GitHub Actions workflows use only 3.10 or later, so these should likely also be enhanced to perform a matrix of tests across all versions of Python this library is intended to work with (perhaps 3.8 through 3.12, with 3.13 to follow soon).

One place in the code attempts to accommodate a very old package structure for urllib functionality that should be simplified/unified. There may be other places that should be modernized, but maybe not. The codebase is relatively small, but I haven't looked for other spots yet.

Enable STAC Output

CMR supports STAC result format for collection/granule retrieval and granule searches: https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#stac

But currently python_cmr doesn't support it:

python_cmr/cmr/queries.py

Lines 48 to 51 in fb58457

    
           _valid_formats_regex = [ 
        
               "json", "xml", "echo10", "iso", "iso19115", 
        
               "csv", "atom", "kml", "native" 
        
           ]

It would be great to support this output format! I made some small changes on a new branch to allow this develop...scottyhq:python_cmr:stac let me know if you're open to a pull request.

Basic usage:

api = GranuleQuery() 
search = api.parameters(
    point=(-105.78, 35.79),
    temporal=('2021-02-01','2021-03-01'),
    collection_concept_id='C2021957657-LPCLOUD'  # Required for STAC search
)
items = search.format("stac").get()

Collecting changes for release 0.11.0

Using this issue as a collection point for stakeholders to decide when we should release the next version of python_cmr.

If you are listed as an assignee (or pinged directly @chuckwondo ), I'd like to have your concurrence that:

The changes below constitute a minor release (meaning the next release will be 0.11.0)
Yes/No should a release be created with just these changes. Yes: release can continue, No: we should wait for some additional feature/issue before creating a release
You are willing to verify the release prior to merging to main

If you think there should be more stakeholders in this discussion, please add them to the list of assignees.

Currently, we have the following changes pending for the next release:

Changed

issues/35 Eliminated
accommodation for Python versions older than 3.8 and updated CI build to test
against Python versions 3.8 through 3.12. Also, fixed all flake8 warnings.

Added

issues/36 Added type annotations.

Fixed

issues/42 Fixed bug where a
KeyError was thrown from Query.get when the query format was a supported
format other than "json". Further, in such cases, too many items would be
fetched from the CMR due to a bug in how items were counted. Now, no more
than limit items are fetched.

$ python -c "import cmr; cmr.queries.GranuleQuery().format('umm_json').short_name('MOD02QKM').get(1)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File ".../python_cmr/cmr/queries.py", line 73, in get
    if page_size > len(response.json()['feed']['entry']) or len(results) >= limit:
                       ~~~~~~~~~~~~~~~^^^^^^^^
KeyError: 'feed'

DayNightFlag type is not exported

When type annotations were added (#44), the DayNightFlag was not exported.

Search for granules within a radius

The cmr api is able to search for granules within a radius but that feature is not available in the python_cmr library.

	_valid_formats_regex = [
	"json", "xml", "echo10", "iso", "iso19115",
	"csv", "atom", "kml", "native"
	]