The minerva-cloud from labsyspharm

Releases

Start doing releases of minerva-infrastructure and related projects
Use version specific dependency on minerva-db
Use version specific dependency on minerva-lib-python
Use version specific Docker Images in Batch Job definitions as per #11
Associated with an appropriate version of minerva-client-js

Implement reader software to Batch Job definition lookup

In order to support multiple readers and versions of readers, implement a mechanism of using the software and its version to determine which job definition (and thus which Docker image) is appropriate.

Could potentially be implemented as:

A lambda function which reads some configuration file potentially partially derived from the job definitions defined in the Batch Cloudformation deployment stage
SSM parameters (less flexible)

Batch job memory allocation override

It will be necessary to dynamically allocate the amount of memory that a batch job will require. The size of an image plane should be determined in the scan phase and this information passed forward to the extraction step function so that it can override the default 1024MB with something appropriate for larger images when launching the batch job.

Record Bio-Formats/Docker Image versions in BFU metadata

Support for multiple OME-XML schema versions and upgrades

At the moment, only one version of the OME-XML schema is supported http://www.openmicroscopy.org/Schemas/OME/2016-06. We should support past/future versions also and/or upgrade the extracted metadata XMLs as the schema moves forward.

Look at pseudo parameter AWS::Partition for non-standard region deployment

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/pseudo-parameter-reference.html#cfn-pseudo-param-partition

Soft delete

Implement a mechanism for soft delete in both the database and object store so that easy rollbacks of mistakenly deleted data are possible.

Backup & Restore

Implement backup and restore procedure for user pool.
Implement backup (snapshots already in place) and restore procedure for database.
Implement protection on S3 in lieu of backups (handled by AWS)
Ensure referential integrity between data sources in the event of a failure
Ensure that in the event of a system wide failure, active jobs are completed successfully (See #26)

Target specific VPCs instead of default

Handling failures

Dead letter queues for failures from all queue based operations (step functions, batch jobs and potentially SQS)
Handle communication failures (e.g. a Docker container attempting to launch a step function)
Log everything
Report failures (e.g. A fileset extract that fails because of Bio-Formats needs to have that registered in the database)
Facility to activate retries from a given stage of processing
Periodically attempt to detect inconsistencies (e.g. unregister objects in S3, leftover data on EFS, incomplete imports without a corresponding record of failure)

Cleanup EFS Staging Area

Cleanup EFS share as files are no longer needed.

Once used for extractions
Post scan for any unrecognised files

Either or both of these operations could be done with a teardown job definition or be tacked onto the duties of the scan and extract jobs.

Database upgrades

Devise a more automated mechanism of doing database schema upgrades.

Create batch roles in cloudformation

This is instead of requiring user created (often created by wizards) roles for several standard functions.

Add Cognito user pool as a common resource

Remove hardcoded references

Pass some of these from the configuration file or retrieve from SSM, etc.

Current list

General System Upgrade Plan

Develop a strategy for doing upgrades to the cloud infrastructure that requires no/partial downtime and seamlessly handles in-progress data processing.

#24, #13, #11, #10, #34, #32, #31, #36 are all related to this.

Cognito user registration and hooks

New user's registered through the admin interface are automatically registered in the application database.

Other useful hooks might be:

Merge one user account into another (delete one cognito account and combine data in database)
Delete/Disable user (delete/disable in cognito, probably do nothing in database)

Manual signup:

Based upon deployment configuration, allow users to self register.

Add render image region functionality into API

Dependant on labsyspharm/minerva-lib-python#21

Batch job submissions timing out

Submissions of batch jobs have been exceeding the 6 second default lambda limit.

[INFO]	2018-09-05T01:17:24.391Z	Found credentials in environment variables.
[INFO]	2018-09-05T01:17:24.638Z	Starting new HTTPS connection (1): ssm.us-east-1.amazonaws.com
START RequestId: e50b2988-6e16-494c-af8e-6e50355d493d Version: $LATEST
Received event: {
"import_uuid": "848e5eea-35e9-412d-adec-a0c023579e96",
"files": [
"ashlar_examples/BP40.ome.tif"
],
"reader": "loci.formats.in.OMETiffReader",
"reader_software": "Bio-Formats",
"reader_version": "(unknown version)",
"fileset_uuid": "e7a6dbc8-a457-4d28-8b3b-65beed085716"
}
Parameters:{
"dir": "848e5eea-35e9-412d-adec-a0c023579e96",
"file": "ashlar_examples/BP40.ome.tif",
"reader": "loci.formats.in.OMETiffReader",
"reader_software": "Bio-Formats",
"reader_version": "(unknown version)",
"fileset_uuid": "e7a6dbc8-a457-4d28-8b3b-65beed085716",
"bucket": "minerva-test-cf-common-tilebucket-1su418jflefem"
}
[INFO]	2018-09-05T01:17:24.875Z	e50b2988-6e16-494c-af8e-6e50355d493d	Starting new HTTPS connection (1): batch.us-east-1.amazonaws.com
END RequestId: e50b2988-6e16-494c-af8e-6e50355d493d
REPORT RequestId: e50b2988-6e16-494c-af8e-6e50355d493d	Duration: 6006.21 ms	Billed Duration: 6000 ms Memory Size: 1024 MB	Max Memory Used: 79 MB	
2018-09-05T01:17:30.840Z e50b2988-6e16-494c-af8e-6e50355d493d Task timed out after 6.01 seconds

[INFO]	2018-09-05T01:17:31.967Z	Found credentials in environment variables.
[INFO]	2018-09-05T01:17:32.37Z	Starting new HTTPS connection (1): ssm.us-east-1.amazonaws.com

There should be no reason for this to time out, it looks like potentially an AWS issue. If this is more than a one time error then we can work around by increasing the lambda timeout, or with retry logic in the step function.

Semantics of reprocessing data

There are several use-cases that warrant reprocessing of data:

Failure during the scan stage to identify a fileset that might be a fixed in a new version of the scanner.
Failure during the extract stage to successfully extract a fileset that might be fixed in a new version of the extractor.
Failure during the scan/extract stage due to unpredicted serverside error that has been resolved.
Even if an extract phase is successfully completed, the extracted metadata or images might be less than optimal and benefit from reprocessing the fileset.

The exact semantics of this needs to be defined before coming up with an implementation strategy.

Questions:

Is a reprocessed import entirely replaced by the reprocessed one?
Is a reprocessed fileset entirely replaced by the reprocessed one?
If reprocessed imports/filesets do not replace the originals, what happens to the originals and how do we record this in the database?

Docker image versioning and their use within Batch Job definitions

Start depending on specific versions of Docker images instead of latest
~~Handle moving from one docker image to another for a job definition.~~
Provide job definitions per docker version (or other changes)

Changing a job definition to a different Docker image with cloudformation requires replacement. I.e. It must be removed and a new one added. It is also desirable not to remove old job definitions that are needed to process the existing queue or potentially to explicitly make use of an older version of Bio-Formats contained in a specific job definition version.

Thus the solution is to add more Job definitions for each new version of a Docker image that is supported. Old job definitions that are no longer useful can be removed once ensuring that they are not being actively used.

This should be handled with reference to whatever mechanism is eventually used to associate scan/extraction software tools and versions with the Docker images that contain them.

This will need to populate the lookup mechanism described in #24

Potentially a separate configuration file for software and versions is required. This could be used at runtime to do lookups, and also to drive the Cloudformation deployment. I.e. for each entry a job definition is defined and deployed.

Deal with stale database connections

Lambda has a known problem where it results in stale database connections.

Import (and any other batch operations) tracking

Add to the API the ability to get information about the status of an import (or other batch operation).

It will probably be necessary to add some tracking information to the database (or other more temporary data store for operational data) in order to be able to query the AWS APIs needed to provide the necessary information. E.g. record the Execution ARN of step functions.

Orchestrate Batch jobs without onward call

It is desirable that none of the Docker images require knowledge of Minerva and AWS as then they can be completely generic and run standalone without modification. However, this may be an overly purist approach.

A moderate approach where the images have local/AWS modes of operation might make sense. The AWS mode of operation would have enhanced capabilities such as writing outputs to S3.

The major question is how to deal with orchestrating the steps in the batch import pipeline. If the scan phase identifies a fileset to process, how should we initiate the extraction phase to follow. Options are:

Write them to a file and upon completion of the scan process the file in lambda and launch many jobs. Very clean and composable into different workflows easily, but leads to increased overall import latency.
Launch the step function for extract directly. Low latency, but more difficult to compose into different workflows and also requires the Docker image to depend directly on the interfaces to the next steps in the pipeline.
Add items to an SQS queue. More complexity than launching the step function directly, but specifically made to handle the type of operation. Again is more difficult to compose into different workflows and requires the Docker image to depend directly on the interfaces to the next steps in the pipeline.
Some kind of opportunistic hybrid approach?

Writing the payloads between steps in the pipeline to S3 may be the best solution as

This allows a more sophisticated configuration of the job without relying on command line parameters and environment variables only which is a bit difficult
Payload limits for SQS/Step/Lambda are quite low

Handle transitions between BatchAMI versions

AWS Batch currently has no mechanism to add custom initialisation when provisioning instances. This feature does exist in ECS, so it seems likely it will eventually be added to Batch as well. Once this is possible it will be possible to dispense entirely with the custom AMI.

Until that time, the custom AMI (which is preconfigured for the specific EFS volume) is required. To upgrade from one AMI to another it will be necessary to delete and create new EC2/spot compute environments. This can't be done while jobs are active. An approach which adds the new compute environments, switches over to them and them removed the old one once it was drained would be logical.

Note: To avoid building an AMI for each Minerva deployment it would have been nice to be able to use environment variables to inform each instance what EFS volume to mount, but unfortunately there is no mechanism to supply the Batch instances with environment variables either.

Optimise EFS Synchronisation

It might be possible to optimise the EFS synchronisation step by only syncing data that could possibly be recognised by the software used in the scan and subsequent extraction steps.

Add web based interface to interact with Repositories, Imports, etc

User website
Admin console

labsyspharm / minerva-cloud Goto Github PK

minerva-cloud's Introduction

Minerva Cloud - AWS backend infrastructure

API Documentation

Prerequisites

Black

AWS Profile

Configuration File

Instructions

minerva-cloud's People

Contributors

Stargazers

Watchers

Forkers

minerva-cloud's Issues

Current list

Recommend Projects

Recommend Topics

Recommend Org