opendatacube / datacube-core Goto Github PK

View Code? Open in Web Editor NEW

488.0 54.0 175.0 46.73 MB

Open Data Cube analyses continental scale Earth Observation data through time

Home Page: http://www.opendatacube.org

License: Apache License 2.0

Python 99.61% Shell 0.21% Dockerfile 0.12% Makefile 0.06%

python gis scientific-computing remote-sensing netcdf numpy raster gdal hacktoberfest

datacube-core's People

Contributors

Stargazers

Watchers

Forkers

cronosnull kyche sixy6e omad tetrafolium spaxe benjimin llymburn yan073 ceos-seo joshmarcus terranis jeremyh v0lat1le qiaojialin hkristen grseb9s gisercau nhoss2 f-akazawa bingo830422 smr547 etsangsplk yujuu cuulee eric1518 loicdtx leosty conabio andrewdhicks valentinlouis davidedelerma ashoka1234 frontiersi petewa roarmstrong pickardjoe alexandretran fitsumw massetting ashu6397 jeannettestrand whisperingpixel augustinh22 iomrc caldempsey samoh-proj zixuedanxin darw007d ark1234 lsgeo huangduan2018 otto-ama simleo zac-hd huihuizhao gisdevelope csiro-easi pukhovaann giswt robbibt diarm001 jcrattz m7200 dieynaba77 waynedou chaudharivaibhav12 fre171csiro shreyashchaudhari kavyagupta1107 luismarquesserco carcari gfkeith wangzhibao001 geospatial-data-science getugms josh09dmello aqw07398 torikirby jeffery-thompson johnrattz philiperleal akarmas whatnick abhijeetkumarga dlubawy-ama ama-ac jdh-ama h920032 nikhil003 atq87 croros anisafirah kirill888 hudsonchase alexnwoko kieranricardo zs856 pandinosaurus tessagarobinson

datacube-core's Issues

Sample applications demonstrating AE/EE with Data Access API

Specify Resampling Algorithm when Ingesting

The resampling method for ingester should support the following:

near: nearest neighbour resampling (default, fastest algorithm, worst interpolation quality).
bilinear: bilinear resampling.
cubic: cubic resampling.
cubicspline: cubic spline resampling.

At the moment there is a space for interpolation in the storage_config.yaml file used to specify the output of an ingestion process. It's currently being ignored by the ingester.

This will be replaced with an option to specify a resampling algorithm, as implemented by gdal. The allowed options are based on http://www.gdal.org/gdalwarper_8h.html#a4775b029869df1f9270ad554c0633843.

Move filename_pattern configuration from storage_type to storage_mapping

filename_pattern makes more sense in storage_mapping.
location (base path) should also be added to storage_mapping

depends on #29?
see 59c1fbe for mockup configs

Enable per pixel metadata/provenance tracking

to enable provenance tracking on a per pixel level - investigate methods which may enable tracing of pixel provenance where the contents of a storage unit may have mixed provenance i.e. Landsat scene overlap

_gdfnetcdf.py - NetCDF - Set dimension issue

Dear Team,

I found an issue in the 'set_dimension' function in _gdfnetcdf.py:
I replaced

            dimension_index_vector = np.around(np.arange(dimension_min, dimension_max, element_size), self.decimal_places)

            dimension_index_vector = np.around(np.linspace(dimension_min, dimension_max, dimension_config['dimension_elements']), self.decimal_places)

because when using a non-integer step, such as 0.25, the results is often not be consistent. (the number of elements doesn't match the expected one, e.g. 1001 instead 1000).

It is better to use linspace for these cases.

Kind regards,

Didier

Prepare Operational Environment

We need PostgreSQL 9.4 running somewhere accessible from raijin.

install 9.4 on a node
test access from PBS nodes

duplicated storage unit search results

Storage unit search returns the same storage unit more than once in the case of time aggregated storage units and search query using dataset fields.

Analytics Expression Language.

Analytics Expression Language - a intuitive language for performing analytics.

Storage unit temporal aggregator tool

Aggregate all timeslices/storage units to single storage unit

Query database
Return single aggregated unit
Update DB

Operationally will be triggered by operator - user command line:
Assumes component timeslices exist
Fixed time range: 1 year for Landsat
Input time slices to aggregate will appropriately tagged to exclude from analysis / processing.

1D/2D/3D visualisation of results.

Document Ingestion Process - include nomenclature

Define terms relating to the storage unit and datasets - what is storage unit? config etc.

ingest data of >4 dimensions

storage write code currently hard coded dimensions
confirm storage unit with higher order dimensionality can be ingested

Ability to load storage and ingest configuration via command-line

As discussed with Alex, most of our configuration handling (such as dataset types, tile/storage types) could be simpler if stored as JSONB documents directly, rather than split across many tables. We've toyed with this in the doc db prototype.

We could allow the user to configure the AGDC (such as adding a new product/dataset type) by specifying json/yaml config documents directly from the command-line, rather than require direct database editing (as with AGDCv1).

Search API for minimal implementation

Search for Storage Units using criteria specified in #11

Search results must include following Storage Unit information:

URI
projection
dimensions (assume cf conventions + labels)
variables/measurements
coordinate extents
number of measurements along each dimension (coordinate length?)
DatasetID (provenance tracking)

_gdfnetcf.py - Invalid GDF.DECIMAL_PLACES

Dear Team,

In _gdfnetcf.py, instruction at line 155, failed.

I think GDF.DECIMAL_PLACES must be changed by self.decimal_places

Replace:
dimension_index_vector = np.around(np.arange(dimension_min, dimension_max, element_size), GDF.DECIMAL_PLACES)

by:
dimension_index_vector = np.around(np.arange(dimension_min, dimension_max, element_size), self.decimal_places)

Kind regards,

Determine reasonable storage compression for storage units

Investigate available compression algorithms and specify a compression rate to optimise data access performance.

Produce compression ratio versus access speed for different algorithm parameters (Nan percentages) -ignore storage file size as a constraint for now. Using image data - not random generated.

Sample applications demonstrating AE/EE with integrated Analytics Expression Language with Data Access API

Sample algorithm on minimal implementation using API

Implement an application demonstrating Search API (#13) and Data Access API (#14)

Do some science! (simple band math example)

Ingester: prepare phase 1 and phase 2 samples for Landsat

Create a set of:

phase 1 - single time, and;
phase 2 - aggregated tiles
to enable tile aggregation method testing

Include GA NBAR and PQ for filter query examples

Data Access API interface as per GDF interface.

Interface between the Data Access API and the AE/EE.
As per GDF with the following modifications:

get_descriptor:
storage_units has storage_max, storage_min and storage_shape. storage_path is to be added.

get_data:
returned data numpy arrays are now xray.DataArrays.

Ingester read support for packaged input and tile / reproject

Enable read for package input (YAML and packaged data from data preparation process). Enable tile and reproject for granule/scene dataset inputs.

enable spatial aggregation of storage units

Storage Unit Access method returns an out of tile nD array or object that wraps this
get_descriptor & get_data requirements are fulfilled.
Further documentation required to explicit define the requirement.

Travis-CI integration is broken - agdc-v2 repo is not showing up in travis and cannot be enabled

The transfer of the agdc v2 repository from the agdc-research-trial organisation to the data-cube organisation has broken travis-ci.org integration. The newly located repository doesn't display in the profiles page where travis configures the github organisation. Pressing sync to update the profile with github fails to change this.

I've emails the travis-ci contact email in an effort to find out what has gone wrong. Judging by a google search this happens from time to time.

subset from input dataset at ingest

For cases where the storage unit only needs to be populated for a subset extent of the input data.

Example: Himawari 8 datasets have data for the entire globe, but we might only want to ingest and store data covering Australia.

Simplify data ingest configuration

User side tool for writing configurations

Group-based permissions & finer-grained authentication

Use of individual user accounts rather than a single shared user/password.
But users should still be able to "module load" the api without any further configuration (editing conf files).
- AGDCv1 did this via a shared user and password hard-coded into the module, but this isn't ideal. It would be preferable to use existing environmental user accounts instead.
  - Look at PAM, LDAP, ident usage within NCI? All are built into Postgres and easy to configure.
Postgres grants should all be to group roles, not user roles.

This will allow for many useful features:

Logging of per-user actions: such as who ingested a dataset
More fine-grained access control (who can ingest, who can administer, who can query).
Minimise password management by using existing systems.

Support query of storage units in EPSG:3577 by geographical bounds

Conditional operators for Analytics Expression Language

GDAL georeferencing support for storage unit

Investigate inclusion of GDAL tags to support gdal read of storage units

add s2 ingester support

add compression rate to storage unit configuration

Write storage units in projection EPSG:3577

Reprojection from ingest projection to new projection

Port Analytical Engine to Search and Storage Access API's

Depends on #13 and #14.
TODO: define better

AE/EE option for on the fly computation.

Write correct spatial metadata to Storage Units

As reported by Josh @sixy6e :

The spatial metadata in the sample files is a little off. Here's an ipython notebook documenting the problem.

https://github.com/sixy6e/my_code/blob/master/Python/notebooks/nc_metadata_tests.ipynb

The current spatial extents exclude the east-most and south-most lines of pixels, instead of including them. Correcting this should fix both the extents and the pixel sizes.

Sample applications demonstrating AE/EE with GDF

Implement an index process for storage tiles

Index storage units to allow searching based on:

dataset attributes
- product type
- platform identifier
- sensor identifier
geographical region (bounding box)
temporal extents

Support for indexing of input dataset

Index input datasets to allow searching based on:

check file already indexed

Integration of Analytics Expression Language into AE/EE

Add ability to configure 'locations'

Locations are named URI 'base paths'. For example gdata location can 'point to' file:///g/data. For now only file system locations are required. In the future web and S3 locations could be supported

see 59c1fbe for a potential config file solution

Ingest Landsat Datasets

Tie together the following functionality

Index the dataset (see #10 )
Generate Storage Units (see #9 )
Index Storage Units (see #11 )

to ingest Geoscience Australia (in the first instance) LS 5,7,8 L1T, PQ, NBAR and FC packaged (eo-datasets) products into specified storage format on demand

Sample application demonstrating Analytical Engine

Implement an application (or applications) demonstrating Analytical Engine (#22)

Something involving simple band maths and/or statistics over time

data preparation support for Landsat (GA/USGS), Sentinel-2 (L1C, L2A), Himawari8 (BRF), MODIS (CSIRO).

Investigate how DataCube is going to support ingesting following datasets:

Landsat from GA (L1T, NBAR, PQ, FC)
Landsat from USGS (L1, SR)
Sentinel-2 L1C and L2A
Himawari-8 BRF
MODIS from CSIRO

Enable query across multiple AGDC database instances of equal version

For common versions of datacube databases on common infrastructure - enable query and data access using more than one database/datastore.

User story: user has a local datacube implementation (datacube 1) but wants to use data from a public instance (datacube 2) in the query.

Aggregate tiles to multi-time storage units

Combine multiple storage units into one by stacking the data along a specified dimension.

Input storage units must not be modified
Coordinates along the stacked dimension must be sorted in the combined SU
Fail if the input storage units do not align perfectly; i.e. do not pad data with NDV's

Sample applications demonstrating AE/EE with integrated Analytics Expression Language with GDF

_gdfnetcdf.py - NetCDF - SetAttributes failed

Dear Team,

In _gdfnetcdf.py, function georeference_from_file failed because of invalid (?) attribute name 'name' in crs_metadata

...
crs_metadata = {'crs:name': spatial_reference.GetAttrValue('geogcs'),
'crs:longitude_of_prime_meridian': 0.0, #TODO: This needs to be fixed!!! An OSR object should have this, but maybe only for specific OSR references??
'crs:inverse_flattening': spatial_reference.GetInvFlattening(),
'crs:semi_major_axis': spatial_reference.GetSemiMajor(),
'crs:semi_minor_axis': spatial_reference.GetSemiMinor(),
}
self.set_variable('crs', dims=(), dtype='i4')
self.set_attributes(crs_metadata)
...

Exception raised while processing storage unit (2015, -28, 111): 'name' is one of the reserved attributes ('_grpid', '_grp', '_varid', 'groups', 'dimensions', 'variables', 'dtype', 'data_model', 'disk_format', '_nunlimdim', 'path', 'parent', 'ndim', 'mask', 'scale', 'cmptypes', 'vltypes', '_isprimitive', 'file_format', '_isvlen', '_iscompound', '_cmptype', '_vltype', 'name', 'orthogoral_indexing', 'keepweakref'), cannot rebind. Use setncattr instead.

I replaced
crs_metadata = {'crs:name': spatial_reference.GetAttrValue('geogcs'),
by
crs_metadata = {'crs:standard_name': spatial_reference.GetAttrValue('geogcs'),
or
crs_metadata = {'crs:long_name_name': spatial_reference.GetAttrValue('geogcs'),

as a workaround

Kind regards,

README.md needs to be updated to reflect AGDC v2 description

README.md currently has the agdc-research-trial documentation and this needs to be replaced with the AGDC v2 description and logo

Add collection metadata to storage units

When setting up a datacube collection, we need to record collection level metadata like:

title: Experimental Data files From the Australian Geoscience Data Cube v2 Development - DO NOT USE
summary: These files are experimental, short lived, and the format will change.
source: This data is a reprojection and retile of Landsat surface reflectance scene data available from /g/data/rs0/scenes/
product_version: 0.0.0
license: Creative Commons Attribution 4.0 International CC BY 4.0

These data need to be loaded from a configuration file into the database and made available to the storage unit writer. The NetCDF files we are writing now won't pass NCI validation without these pieces of metadata.

At the moment this sort of data is in the mapping documents, but probably doesn't belong there. Where should it go and what do we need to get it passed around?

Storage Access API - retrieve data from minimal build

Use results returned by the Search API(#13) and provide:

construct analysis array elements from storage units
labeled data (with xray)
lazy loading to facilitate out of core processing (with dask for example)
group data by inter-operability
- same projection
- same coordinate extents
- same 'resolution'

opendatacube / datacube-core Goto Github PK

datacube-core's People

Contributors

Stargazers

Watchers

Forkers

datacube-core's Issues

Recommend Projects

Recommend Topics

Recommend Org