linz / geostore Goto Github PK
View Code? Open in Web Editor NEWCentral storage, management and access for important geospatial datasets
License: MIT License
Central storage, management and access for important geospatial datasets
License: MIT License
This should speed up the CI pipelines at least somewhat, and avoid hitting the fairly unreliable npm.org on every CI run.
In order to know where my data came from and find upstream data I may be interested in, as a Data Maintainer, I want to get the lineage, or linkages, between the datasets.
E.g. in hydro things like:
raw hydro survey data -> thick point clouds -> bathy grids
We currently run everything on "ubuntu-latest". If that image changes at an impractical time, such as at release, that could cause CI blockage for a while. We should probably use a fixed Ubuntu version and document how and when to upgrade it.
We've already encountered some issues with Python 3.6 (cattrs no longer supporting it and less support for type annotations), and we have no reason to support multiple Python versions in production.
To do:
.python-version
.PYTHON_3_8
runtime for AWS Lambda jobs.pyproject.toml
to ^3.8,<3.9
.So that I don't accidentally lose my important data, as Data Maintainer, I want to only delete datasets with no data in them.
Note: datasets can be altered in every way, so there should never be a need to delete them. We may implement an 'archive' feature later if needed.
There are organisations outside of LINZ that are interested in what we are doing. We should make the repo public as soon as possible. I think it's fine to do this while it's a work in progress as long as we indicate that somehow.
Add database table for storing datasets validation results and enable required read and read/write access to it for Dataset Version creation process resources
For CD via GitHub Actions to deploy the Data-Lake infrastructure aws service accounts for AWS LI Datalake NonProd and AWS LI Datalake Prod are required
Add more storage space for AWS Batch containers required for data. If you want to work with larger files etc
In order to share my metadata with users via the LINZ Data Service, as a Data Maintainer, I want to convert the metadata I have already recorded in the Data Lake (STAC standard) to ISO 19115/19139 standard.
LINZ has a form to fill out to indicate whether a full privacy impact assessment is required.
As part of open sourcing the project issues (all = open & closed) need to be review and have any sensitive information redacted.
Implement Dataset Space Lambda handler function
botocore.response.StreamingBody.iter_chunks
has a default chunk size of 1024. This may not be optimal for the file sizes we have to deal with. We should check whether different chunk sizes make a big difference to file processing time. Test process suggestion:
Implement Dataset Version State Machine error handling
Use CloudFormation generated unique AWS resources names to avoid resource names clashes when multiple copies of stacks are created in same AWS account.
core.Tag.add(app, "CostCentre", "1050")
core.Tag.add(app, "ApplicationName", "geospatial-data-lake")
core.Tag.add(app, "Owner", "Bill M. Nelson")
core.Tag.add(self, "EnvironmentType", env)
master
branch to AWS nonprod accountrelease-x.y
branch, deploy to AWS prod accountSee Caching dependencies to speed up workflows and pip examples, bearing in mind some complexities:
pip install --no-dev
) and full (pip install
) dependencies separately.Both of these should be doable using PIP_CACHE_DIR
.
I've been calling the product 'LINZ Geospatial Data Lake' to make it clear it's not for non-geo data.
We should consider renaming this repo I think to encapsulate what it does.
Option could be linz-geospatial-data-lake
or linz-geo-data-lake
? Alternatively we could come up with a completely new name for it?
Should also remove the '3' at some point and rename https://github.com/linz/linz-data-lake to something that indicates it has been replaced.
As part of the Store Topo Historic Imagery data epic s3 data storage is to be delivered.
Relevant software documentation must be also delivered
What degree of detail is required?
cdk deploy
So that I can ensure my metadata has all the LINZ required metadata elements, as Data Maintainer, I want to validate the LINZ top-level metadata extension
LINZ metadata extension profile
As part of open sourcing the geospatial data-lake repo we must first ensure there is no sensitive information within the repository.
Implement GET request response paging
Create Data Lake Github team
Lambda bundling script is not running in Docker anymore at least on my machine.
@SPlanzer commented on Thu Sep 17 2020
In order to build CD the CDK project must be initialised.
Input: STAC asset with checksum.
Output:
@imincik commented on Tue Nov 10 2020
The set-env
command is deprecated and will be disabled on November 16th. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2020-10-01-github-actions-deprecating-set-env-and-add-path-commands
Example of fix in other our repo: https://github.com/linz/bde-processor-deployment/pull/916/files
Create simple detail design of Data Lake S3 storage
Should be able to use the STAC JSON schema directly.
Questions:
How are the catalog, collection and item schemas related?
STAC Collections share the same fields with Catalogs and therefore every Collection is also a valid Catalog.
Catalogs and collections can link to items, catalogs and collections.
Can all STAC JSON files be validated against the top-level collection schema file? If not, how do we detect whether a file should be validated against the catalog, collection or item schema?
Are we comfortable with transforming JSON into Python dict
s before validating them? They can't be exactly equivalent, since JSON parsing isn't consistent. jsonschema.validate
seems to take a Python list
or dict
, so presumably the transformation is expected to be reliable enough. We are also going to be using Python as the main language in this project, so only some truly gnarly JSON should be able to cause issues.
Which links do we have to follow to verify an entire dataset? Should we follow all .links[] | .href
, including .rel == "self"
and .rel == "parent"
? Are there other links we need to follow?
Notes:
validate
error propagate all the way.date-time
and uri
.A s3 bucket is required to store datalake data
Implement AWS resources needed for Dataset Space #26
Goals:
Tasks:
Lambda bundling script is installing much more packages than required by requirements.txt
file. Lambda runtime contains lot of pre-installed packages and only those mentioned in requirements.txt
file needs to be bundled with our Python function.
Cyclomatic complexity is a good proxy for detecting code which is difficult to reason about.
In an earlier Python project we started with a maximum cyclomatic complexity of four and ended up with a maximum of six after 30 months (~8 developer-years). This project should be able to stay well within that.
There is a significant latency (more than 10 seconds) in AWS Batch containers launch for each new container in array.
ECS agent is configured to cache Docker images by following configuration, but it might not work correctly.
ECS_IMAGE_PULL_BEHAVIOR=prefer-cached >> /etc/ecs/ecs.config
More investigation is needed. Requires ssh to ECS instance.
The license should be changed to MIT to align with the LINZ standard.
In order to provide context for my data, as a Data Maintainer, I want to store some supplementary files with my data and I don't want these validated.
Examples:
Split "datalake" stack to storage and endpoints stacks
PR #56 introduced separate Lambda function bundling script for processing
functions. We need one bundling script for all Lambda functions for all backend
code.
Tests are required to validate the state of the data-lake environment and src code
@imincik @billgeo it would be good to discuss test strategy.
I note that the latest CDK testing docs state:
currently, TypeScript is the only supported language for testing AWS CDK infrastructure, though
we intend to eventually make this capability available in all languages supported by the AWS CDK.
I am therefore expecting we
Implement Lambda State Machine for Dataset validation
With LGTM you can run security checks with tools like https://securitylab.github.com/tools/codeql. See what basemaps have done as an example: https://github.com/linz/basemaps/blob/master/.github/workflows/codeql-analysis.yml
Implement Dataset Version State Machine logging
logs={
"destination": log_group,
"level": stepfunctions.LogLevel.ALL
}
See https://confluence.linz.govt.nz/display/GEOD/Logging and https://docs.python-guide.org/writing/logging/
CDK seems to create a directory layout which breaks IDE integration and Pylint validation. Just using relative imports does not work.
Originally posted by @imincik in #56 (comment)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.