Cache NPM downloads

This should speed up the CI pipelines at least somewhat, and avoid hitting the fairly unreliable npm.org on every CI run.

Infra: use specific CIDR block for datalake VPC

Lineage/linkages between datasets

User Story

In order to know where my data came from and find upstream data I may be interested in, as a Data Maintainer, I want to get the lineage, or linkages, between the datasets.

Acceptance Criteria

...
...
...

Additional context

E.g. in hydro things like:

raw hydro survey data -> thick point clouds -> bathy grids

Definition of Ready

This story is ready to work on, according to the team's definition

Definition of Done

This story is done, according to the team's definition

Subtasks

Use specific Ubuntu version for CI

We currently run everything on "ubuntu-latest". If that image changes at an impractical time, such as at release, that could cause CI blockage for a while. We should probably use a fixed Ubuntu version and document how and when to upgrade it.

Use Python 3.8 across the board

We've already encountered some issues with Python 3.6 (cattrs no longer supporting it and less support for type annotations), and we have no reason to support multiple Python versions in production.

To do:

Set Python version in .python-version.
Use a single version in CI, removing the need for the strategy matrix.
Use the PYTHON_3_8 runtime for AWS Lambda jobs.
Change the version in pyproject.toml to ^3.8,<3.9.

Dataset Space: allow space deletion only if empty

User Story

So that I don't accidentally lose my important data, as Data Maintainer, I want to only delete datasets with no data in them.

Note: datasets can be altered in every way, so there should never be a need to delete them. We may implement an 'archive' feature later if needed.

Accpetance Criteria

Given a dataset with 1 or more versions, when a dataset DELETE is requested, then a message is returned and the dataset is not deleted
Given a datasets with no versions created, when a dataset DELETE is requested, then the dataset is deleted and no longer appears in the list of datasets

Tasks

[ ]

Ready

This story is ready

Done

This story is done.

Make Geospatial Data Lake repo public

There are organisations outside of LINZ that are interested in what we are doing. We should make the repo public as soon as possible. I think it's fine to do this while it's a work in progress as long as we indicate that somehow.

Tasks

LGTM
check source code for non public content
open and close tickets

Infra: add database table for storing datasets validation results

Add database table for storing datasets validation results and enable required read and read/write access to it for Dataset Version creation process resources

aws service accounts required for CD

For CD via GitHub Actions to deploy the Data-Lake infrastructure aws service accounts for AWS LI Datalake NonProd and AWS LI Datalake Prod are required

I will contact Terrence to ensure the process of getting such roles is still the same (i.e. go through service desk)
Start the process of getting these
Document these roles in confluence

Infra: add more storage space for AWS Batch containers

Add more storage space for AWS Batch containers required for data. If you want to work with larger files etc

Convert metadata to ISO 19115/19139

User Story

In order to share my metadata with users via the LINZ Data Service, as a Data Maintainer, I want to convert the metadata I have already recorded in the Data Lake (STAC standard) to ISO 19115/19139 standard.

Acceptance Criteria

Given a Data Lake dataset has valid STAC metadata, When a Data Analyst or a connected system extracts the dataset, They optionally receive a valid ISO 19115/19139 metadata XML file.
Given a LINZ Data Service data source is created for a Data Lake dataset, when the Data Analyst adds metadata from the data source, they can successfully import metadata from the data source.
Given valid required STAC metadata content, when the metadata is converted, content is converted/copied to ISO valid content.
Given valid optional STAC metadata congtent, when the metadata is converted, content is converted/copied to ISO valid content.

Additional context

Definition of Ready

This story is ready to work on, according to the team's definition

Definition of Done

This story is done, according to the team's definition

Subtasks

Privacy Threshold Assessment

LINZ has a form to fill out to indicate whether a full privacy impact assessment is required.

review issues for sensitive information

As part of open sourcing the project issues (all = open & closed) need to be review and have any sensitive information redacted.

Implement Dataset Space Lambda handler function

Optimize S3 file read chunk size

botocore.response.StreamingBody.iter_chunks has a default chunk size of 1024. This may not be optimal for the file sizes we have to deal with. We should check whether different chunk sizes make a big difference to file processing time. Test process suggestion:

Change the checksum Lambda function to take an optional chunk size parameter.
Deploy the Lambda.
Create files in S3 with representative sizes with content from /dev/random.
Write a small separate Lambda function to profile the first one (to avoid a big HTTP overhead in the timing info) using different chunk sizes on the files created in the last step.

Infra: implement Dataset Version State Machine error handling

Implement Dataset Version State Machine error handling

Infra: use CloudFormation generated unique AWS resources names

Use CloudFormation generated unique AWS resources names to avoid resource names clashes when multiple copies of stacks are created in same AWS account.

Set global tags per AWS stack

app.py

core.Tag.add(app, "CostCentre", "1050")
core.Tag.add(app, "ApplicationName", "geospatial-data-lake")
core.Tag.add(app, "Owner", "Bill M. Nelson")

data_stores/data_lake_stack.py

core.Tag.add(self, "EnvironmentType", env)

Create initial Data Lake CD

Task

deploy all changes from master branch to AWS nonprod account
when release tag is create in release-x.y branch, deploy to AWS prod account

Acceptance Criteria

automated deployment to prod and non prod AWS is tested and is working

Cache pip/Poetry downloads

See Caching dependencies to speed up workflows and pip examples, bearing in mind some complexities:

We might want to cache the requirements for different endpoints separately, to minimize the amount of copying per job.
Alternatively, we might want to cache the non-development (pip install --no-dev) and full (pip install) dependencies separately.

Both of these should be doable using PIP_CACHE_DIR.

Rename repo?

I've been calling the product 'LINZ Geospatial Data Lake' to make it clear it's not for non-geo data.
We should consider renaming this repo I think to encapsulate what it does.

Option could be linz-geospatial-data-lake or linz-geo-data-lake? Alternatively we could come up with a completely new name for it?

Should also remove the '3' at some point and rename https://github.com/linz/linz-data-lake to something that indicates it has been replaced.

Document s3 storage

As part of the Store Topo Historic Imagery data epic s3 data storage is to be delivered.

Relevant software documentation must be also delivered

Discussion required

What degree of detail is required?

Docs already include running cdk deploy
Do we require as-builts that outline names and permissions? though it can be easily argued that the cdk code is self-documenting in this sense.
What other docs do we require as part of the Store Topo Historic Imagery data documentation deliverable?

Validate the 'LINZ' top-level metadata extension

So that I can ensure my metadata has all the LINZ required metadata elements, as Data Maintainer, I want to validate the LINZ top-level metadata extension

LINZ metadata extension profile

Acceptance Criteria

LINZ top-level extension is validated as a 'required' (mandatory) extension
If the LINZ top-level extension is missing then an error is returned to the user and import is aborted
Validation errors are returned to the user
Validation rules are well-tested including negative tests (in the source STAC json schema repo?)

Tasks

Add LINZ STAC spec schema to Geostore validation
Make sure 'user friendly' error messages are returned to the user

Add script to set up development environment

Check source code for non public content

As part of open sourcing the geospatial data-lake repo we must first ensure there is no sensitive information within the repository.

Dataset Space: implement GET request response paging

Implement GET request response paging

Create Data Lake Github team

Lambda bundling script is not running in Docker anymore

Lambda bundling script is not running in Docker anymore at least on my machine.

Intialise CDK Project

@SPlanzer commented on Thu Sep 17 2020

Task

In order to build CD the CDK project must be initialised.

Acceptance Criteria

Peer review passed

Add Batch job to verify file checksum

Input: STAC asset with checksum.

Output:

Code 200 and empty body in case of success.
Code 400 and message
- "Checksum missing. See https://github.com/radiantearth/stac-spec/blob/v0.9.0/extensions/checksum/README.md."
- "Invalid checksum. See https://github.com/radiantearth/stac-spec/blob/v0.9.0/extensions/checksum/README.md."
- "File not found."
- "Checksum mismatch. Is the file corrupted?"
- "Not an S3 URL."
Code 500 and message "Unknown error. Please contact support."

CI: replace deprecated `set-env` command used in Github Actions workflow

@imincik commented on Tue Nov 10 2020

The set-env command is deprecated and will be disabled on November 16th. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2020-10-01-github-actions-deprecating-set-env-and-add-path-commands

Example of fix in other our repo: https://github.com/linz/bde-processor-deployment/pull/916/files

Detail design of Data Lake S3 storage

Create simple detail design of Data Lake S3 storage

Verify STAC format of metadata file

Should be able to use the STAC JSON schema directly.

Questions:

How are the catalog, collection and item schemas related?

STAC Collections share the same fields with Catalogs and therefore every Collection is also a valid Catalog.

Catalogs and collections can link to items, catalogs and collections.
Can all STAC JSON files be validated against the top-level collection schema file? If not, how do we detect whether a file should be validated against the catalog, collection or item schema?
Are we comfortable with transforming JSON into Python dicts before validating them? They can't be exactly equivalent, since JSON parsing isn't consistent. jsonschema.validate seems to take a Python list or dict, so presumably the transformation is expected to be reliable enough. We are also going to be using Python as the main language in this project, so only some truly gnarly JSON should be able to cause issues.
Which links do we have to follow to verify an entire dataset? Should we follow all .links[] | .href, including .rel == "self" and .rel == "parent"? Are there other links we need to follow?

Notes:

Use a Git submodule for the STAC JSON schema repo. This avoids independently tracking their content, and makes it trivial to upgrade the schema version when we want to.
As an MVP, just let any validate error propagate all the way.
Write a function which takes an S3 URL and validates the file behind it.
Change the function to validate each link it hasn't yet visited.
Given a link structure like A → B, A → C, B → D and C → D, make sure each file is only validated once. This means we can't use a naive recursive implementation, since D would be validated after each of B and C.
Make sure to install and verify optional format validators. Includes at least date-time and uri.

Create the Datalake S3 bucket

A s3 bucket is required to store datalake data

Bucket should be deployable via CDK to both prod and nonprod environments
This bucket (for now) should be private and only accessible to those within the data-lake-prod/non-prod roles

Implement AWS resources needed for Dataset Space #26

Verify test coverage

Goals:

Make sure we don't miss untested code in PRs.
Improve overall quality.
Ensure a strong final coverage.
Encourage writing more unit tests rather than higher-level, slower and more brittle tests.

Tasks:

Calculate test coverage in CI.
Produce a report of how much coverage there is per file.
If possible, save/publish the coverage report during CI runs for ease of use.
Fail CI if coverage is below the minimum.
Make it easy to update the coverage minimum (ideally a single number in the code base), and document how to do it.

Lambda bundling script is installing much more packages than required

Lambda bundling script is installing much more packages than required by requirements.txt file. Lambda runtime contains lot of pre-installed packages and only those mentioned in requirements.txt file needs to be bundled with our Python function.

Verify code complexity

Cyclomatic complexity is a good proxy for detecting code which is difficult to reason about.

In an earlier Python project we started with a maximum cyclomatic complexity of four and ended up with a maximum of six after 30 months (~8 developer-years). This project should be able to stay well within that.

Infra: slow launch of AWS Batch containers

There is a significant latency (more than 10 seconds) in AWS Batch containers launch for each new container in array.

ECS agent is configured to cache Docker images by following configuration, but it might not work correctly.

ECS_IMAGE_PULL_BEHAVIOR=prefer-cached >> /etc/ecs/ecs.config

More investigation is needed. Requires ssh to ECS instance.

Decide on development environment (personal AWS accounts vs Localstack)

Personal AWS accounts

Pros

ready out-of-box
it is exactly the same place where prod environment will be deployed

Cons

AWS stack update deadlocks occur from time to time during development and they usually last couple of hours
slow deployment of new code (for example when developing Lambda f.)
difficult to run unit tests using Github Actions CI and AWS account
conflicts between global resource names deployed to multiple dev's accounts (S3 bucket names)
full AWS CI/CD Pipeline might be needed (CodePipeline, CodeDeploy)

Localstack running on dev's machine

Pros

nice Pytest integration
fast execution of Pytest unit tests against Localstack
nice Pytest integration with Github Actions CI
easier ways of code debugging on local machine
no infrastructure cost
in case of troubles, there is paid Localstack Pro edition - https://localstack.cloud/

Cons

Localstack's AWS emulation might not be (is not) 100% complete and troubles free

Change license to MIT

The license should be changed to MIT to align with the LINZ standard.

Store supplementary files

User Story

In order to provide context for my data, as a Data Maintainer, I want to store some supplementary files with my data and I don't want these validated.

Acceptance Criteria

...
...
...

Additional context

Examples:

Thumbnails (or a derived dataset?)
Documents such as reports, spreadsheets, specifications, data dictionaries, plans etc
Index data, such as vector data with extents of raster data tiles (or should these be a different dataset?)

Definition of Ready

This story is ready to work on, according to the team's definition

Definition of Done

This story is done, according to the team's definition

Discussion required:

@imincik @billgeo it would be good to discuss test strategy.

I note that the latest CDK testing docs state:

currently, TypeScript is the only supported language for testing AWS CDK infrastructure, though 
we intend to eventually make this capability available in all languages supported by the AWS CDK.

I am therefore expecting we

Use the PyTest framework for unit level tests that are executed locally and via CD
Have tests that run in CD after deployment to ensure the bucket is accessible and permissions are as expected

logs={
     "destination": log_group,
     "level": stepfunctions.LogLevel.ALL
 }

See https://confluence.linz.govt.nz/display/GEOD/Logging and https://docs.python-guide.org/writing/logging/

Enable Dependabot

Avoid having to disable import errors in infrastructure code

CDK seems to create a directory layout which breaks IDE integration and Pylint validation. Just using relative imports does not work.

Originally posted by @imincik in #56 (comment)

linz / geostore Goto Github PK

geostore's Issues

User Story

Acceptance Criteria

Additional context

Definition of Ready

Definition of Done

Subtasks

User Story

Accpetance Criteria

Tasks

Ready

Done

Tasks

User Story

Acceptance Criteria

Additional context

Destination:

Target

Content mapping

Definition of Ready

Definition of Done

Subtasks

Task

Acceptance Criteria

Discussion required

Acceptance Criteria

Tasks

Task

Acceptance Criteria

Personal AWS accounts

Pros

Cons

Localstack running on dev's machine

Pros

Cons

User Story

Acceptance Criteria

Additional context

Definition of Ready

Definition of Done

Subtasks

Discussion required:

Recommend Projects

Recommend Topics

Recommend Org