Giter Site home page Giter Site logo

labsyspharm / minerva-cloud Goto Github PK

View Code? Open in Web Editor NEW
2.0 8.0 4.0 345 KB

Minerva Cloud is a novel cloud-native (AWS) platform for high-dimensional microscopy image storage, management, and visualization.

License: MIT License

Python 99.94% Dockerfile 0.06%
minerva aws cloud-computing microscopy ome immunofluorescence

minerva-cloud's Introduction

Code style: black

Minerva Cloud - AWS backend infrastructure

This repository contains the templates necessary to deploy the Minerva Cloud platform in AWS. It contains CloudFormation templates for creating the AWS infrastructure (S3 buckets, database, Cognito userpool etc.), and Serverless Framework configurations for creating various serverless applications.

API Documentation

Minerva API

Prerequisites

These need to be created manually in AWS console or with the AWS CLI:

  • A VPC in the desired AWS region.
  • A pair of public subnets in the VPC.
  • A pair of private subnets with NAT gateways configured in the VPC.
  • A default security group which allows communication in/out from itself.
  • A security group which allows SSH communication to EC2 instances as required.
  • A yaml configuration file with these and some other properties.
  • A deployment bucket for Serverless Framework.

Black

The code is formatted using black. This was implemented all-at-once, and for the most useful git blame, we suggest you run

git config blame.ignoreRevsFile .git-blame-ignore-revs

See Black docs for more information.

AWS Profile

If you need to use a different aws profile from the default one, to be able to access aws resources, this can be setup with:

  • export AWS_PROFILE=profile_name

Configuration File

There is an example configuration file included in the repository: minerva-config.example.yml You need to update the vpc, subnets and other values in the configuration file.

Instructions

You can later update the stacks by replacing word "create" with "update" Instructions below presume you have the configuration file in a folder named minerva-configs, which is a sibling to the minerva-cloud project root directory.

Before deploying the various serverless applications, you should install the needed node packages by running within each serverless/* directory:

npm install
  1. Deploy the common cloudformation infrastructure
# Run in /cloudformation
python cloudformation.py create common ../../minerva-configs/test/config.yml
  1. Deploy the cognito cloudformation infrastructure
# Run in /cloudformation
python cloudformation.py create cognito ../../minerva-configs/test/config.yml
  1. Build the Batch AMI (Amazon Machine Image)
# Run in /ami-builder
python build.py ../../minerva-configs/test/config.yml

After the image has been created, the Batch AMI ID must be added to config.yml.

  1. Deploy the Batch cloudformation infrastructure
# Run in /cloudformation
python cloudformation.py create batch ../../minerva-configs/test/config.yml
  1. Deploy the auth serverless infrastructure
# Run in /serverless/auth
serverless deploy --configfile ../../../minerva-configs/test/config.yml
  1. Deploy the db serverless infrastructure
# Run in /serverless/db
serverless deploy --configfile ../../../minerva-configs/test/config.yml
  1. Deploy the batch serverless infrastructure
# Run in /serverless/batch
serverless deploy --configfile ../../../minerva-configs/test/config.yml
  1. Deploy the api serverless infrastructure
# Run in /serverless/api
serverless deploy --configfile ../../../minerva-configs/test/config.yml
  1. Deploy the author serverless infrastructure (OPTIONAL)
  • This is only for integrating Minerva Author with Minerva Cloud
# Run in /cloudformation
python cloudformation.py create author ../../minerva-configs/test/config.yml
# Run in /serverless/author
serverless deploy --configfile ../../../minerva-configs/test/config.yml
  1. Run AWS lambda initdb function to initialise the database
  • Find the function name (e.g. minerva-test-dev-initDb) from AWS Lambda console
  • Open the function and click "Test"
  1. Create some users using the AWS Cognito console
  • The new users are automatically created in Minerva database by a Cognito trigger.
  • The password has to be updated on the first sign-in

minerva-cloud's People

Contributors

adriana-pop avatar dpwrussell avatar juha-ruokonen avatar pagreene avatar tsepikov avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

minerva-cloud's Issues

Releases

  • Start doing releases of minerva-infrastructure and related projects
  • Use version specific dependency on minerva-db
  • Use version specific dependency on minerva-lib-python
  • Use version specific Docker Images in Batch Job definitions as per #11
  • Associated with an appropriate version of minerva-client-js

Implement reader software to Batch Job definition lookup

In order to support multiple readers and versions of readers, implement a mechanism of using the software and its version to determine which job definition (and thus which Docker image) is appropriate.

Could potentially be implemented as:

  • A lambda function which reads some configuration file potentially partially derived from the job definitions defined in the Batch Cloudformation deployment stage
  • SSM parameters (less flexible)

Batch job memory allocation override

It will be necessary to dynamically allocate the amount of memory that a batch job will require. The size of an image plane should be determined in the scan phase and this information passed forward to the extraction step function so that it can override the default 1024MB with something appropriate for larger images when launching the batch job.

Support for multiple OME-XML schema versions and upgrades

At the moment, only one version of the OME-XML schema is supported http://www.openmicroscopy.org/Schemas/OME/2016-06. We should support past/future versions also and/or upgrade the extracted metadata XMLs as the schema moves forward.

Soft delete

Implement a mechanism for soft delete in both the database and object store so that easy rollbacks of mistakenly deleted data are possible.

Backup & Restore

  • Implement backup and restore procedure for user pool.
  • Implement backup (snapshots already in place) and restore procedure for database.
  • Implement protection on S3 in lieu of backups (handled by AWS)
  • Ensure referential integrity between data sources in the event of a failure
  • Ensure that in the event of a system wide failure, active jobs are completed successfully (See #26)

Handling failures

  • Dead letter queues for failures from all queue based operations (step functions, batch jobs and potentially SQS)
  • Handle communication failures (e.g. a Docker container attempting to launch a step function)
  • Log everything
  • Report failures (e.g. A fileset extract that fails because of Bio-Formats needs to have that registered in the database)
  • Facility to activate retries from a given stage of processing
  • Periodically attempt to detect inconsistencies (e.g. unregister objects in S3, leftover data on EFS, incomplete imports without a corresponding record of failure)

Cleanup EFS Staging Area

Cleanup EFS share as files are no longer needed.

  • Once used for extractions
  • Post scan for any unrecognised files

Either or both of these operations could be done with a teardown job definition or be tacked onto the duties of the scan and extract jobs.

Database upgrades

Devise a more automated mechanism of doing database schema upgrades.

Remove hardcoded references

Pass some of these from the configuration file or retrieve from SSM, etc.

Current list

  • serverless, db, service
  • serverless, db, Default VPC for SSM
  • serverless, db, subnetIds
  • serverless, db, deploymentBucket
  • serverless, db, STACK_PREFIX
  • serverless, db, STAGE
  • serverless, batch, service
  • serverless, batch, Default VPC for SSM
  • serverless, batch, subnetIds
  • serverless, batch, deploymentBucket
  • serverless, batch, STACK_PREFIX
  • serverless, batch, STAGE
  • serverless, api, service
  • serverless, api, Default VPC for SSM
  • serverless, api, subnetIds
  • serverless, api, deploymentBucket
  • serverless, api, STACK_PREFIX
  • serverless, api, STAGE
  • serverless, api, restApiId
  • serverless, api, restApiRootResourceId
  • serverless, api, /image/{uuid}

Cognito user registration and hooks

  • New user's registered through the admin interface are automatically registered in the application database.

Other useful hooks might be:

  • Merge one user account into another (delete one cognito account and combine data in database)
  • Delete/Disable user (delete/disable in cognito, probably do nothing in database)

Manual signup:

  • Based upon deployment configuration, allow users to self register.

Batch job submissions timing out

Submissions of batch jobs have been exceeding the 6 second default lambda limit.

[INFO]	2018-09-05T01:17:24.391Z	Found credentials in environment variables.
[INFO]	2018-09-05T01:17:24.638Z	Starting new HTTPS connection (1): ssm.us-east-1.amazonaws.com
START RequestId: e50b2988-6e16-494c-af8e-6e50355d493d Version: $LATEST
Received event: {
"import_uuid": "848e5eea-35e9-412d-adec-a0c023579e96",
"files": [
"ashlar_examples/BP40.ome.tif"
],
"reader": "loci.formats.in.OMETiffReader",
"reader_software": "Bio-Formats",
"reader_version": "(unknown version)",
"fileset_uuid": "e7a6dbc8-a457-4d28-8b3b-65beed085716"
}
Parameters:{
"dir": "848e5eea-35e9-412d-adec-a0c023579e96",
"file": "ashlar_examples/BP40.ome.tif",
"reader": "loci.formats.in.OMETiffReader",
"reader_software": "Bio-Formats",
"reader_version": "(unknown version)",
"fileset_uuid": "e7a6dbc8-a457-4d28-8b3b-65beed085716",
"bucket": "minerva-test-cf-common-tilebucket-1su418jflefem"
}
[INFO]	2018-09-05T01:17:24.875Z	e50b2988-6e16-494c-af8e-6e50355d493d	Starting new HTTPS connection (1): batch.us-east-1.amazonaws.com
END RequestId: e50b2988-6e16-494c-af8e-6e50355d493d
REPORT RequestId: e50b2988-6e16-494c-af8e-6e50355d493d	Duration: 6006.21 ms	Billed Duration: 6000 ms Memory Size: 1024 MB	Max Memory Used: 79 MB	
2018-09-05T01:17:30.840Z e50b2988-6e16-494c-af8e-6e50355d493d Task timed out after 6.01 seconds

[INFO]	2018-09-05T01:17:31.967Z	Found credentials in environment variables.
[INFO]	2018-09-05T01:17:32.37Z	Starting new HTTPS connection (1): ssm.us-east-1.amazonaws.com

There should be no reason for this to time out, it looks like potentially an AWS issue. If this is more than a one time error then we can work around by increasing the lambda timeout, or with retry logic in the step function.

Semantics of reprocessing data

There are several use-cases that warrant reprocessing of data:

  • Failure during the scan stage to identify a fileset that might be a fixed in a new version of the scanner.
  • Failure during the extract stage to successfully extract a fileset that might be fixed in a new version of the extractor.
  • Failure during the scan/extract stage due to unpredicted serverside error that has been resolved.
  • Even if an extract phase is successfully completed, the extracted metadata or images might be less than optimal and benefit from reprocessing the fileset.

The exact semantics of this needs to be defined before coming up with an implementation strategy.

Questions:

  • Is a reprocessed import entirely replaced by the reprocessed one?
  • Is a reprocessed fileset entirely replaced by the reprocessed one?
  • If reprocessed imports/filesets do not replace the originals, what happens to the originals and how do we record this in the database?

Docker image versioning and their use within Batch Job definitions

  • Start depending on specific versions of Docker images instead of latest
  • Handle moving from one docker image to another for a job definition.
  • Provide job definitions per docker version (or other changes)

Changing a job definition to a different Docker image with cloudformation requires replacement. I.e. It must be removed and a new one added. It is also desirable not to remove old job definitions that are needed to process the existing queue or potentially to explicitly make use of an older version of Bio-Formats contained in a specific job definition version.

Thus the solution is to add more Job definitions for each new version of a Docker image that is supported. Old job definitions that are no longer useful can be removed once ensuring that they are not being actively used.

This should be handled with reference to whatever mechanism is eventually used to associate scan/extraction software tools and versions with the Docker images that contain them.

This will need to populate the lookup mechanism described in #24

Potentially a separate configuration file for software and versions is required. This could be used at runtime to do lookups, and also to drive the Cloudformation deployment. I.e. for each entry a job definition is defined and deployed.

Import (and any other batch operations) tracking

Add to the API the ability to get information about the status of an import (or other batch operation).

  • Sync phase status
  • Scan phase status
  • Filesets extracted
  • Filesets yet to be extracted
  • Failed fileset extractions
  • Overall status

It will probably be necessary to add some tracking information to the database (or other more temporary data store for operational data) in order to be able to query the AWS APIs needed to provide the necessary information. E.g. record the Execution ARN of step functions.

Orchestrate Batch jobs without onward call

It is desirable that none of the Docker images require knowledge of Minerva and AWS as then they can be completely generic and run standalone without modification. However, this may be an overly purist approach.

A moderate approach where the images have local/AWS modes of operation might make sense. The AWS mode of operation would have enhanced capabilities such as writing outputs to S3.

The major question is how to deal with orchestrating the steps in the batch import pipeline. If the scan phase identifies a fileset to process, how should we initiate the extraction phase to follow. Options are:

  • Write them to a file and upon completion of the scan process the file in lambda and launch many jobs. Very clean and composable into different workflows easily, but leads to increased overall import latency.
  • Launch the step function for extract directly. Low latency, but more difficult to compose into different workflows and also requires the Docker image to depend directly on the interfaces to the next steps in the pipeline.
  • Add items to an SQS queue. More complexity than launching the step function directly, but specifically made to handle the type of operation. Again is more difficult to compose into different workflows and requires the Docker image to depend directly on the interfaces to the next steps in the pipeline.
  • Some kind of opportunistic hybrid approach?

Writing the payloads between steps in the pipeline to S3 may be the best solution as

  • This allows a more sophisticated configuration of the job without relying on command line parameters and environment variables only which is a bit difficult
  • Payload limits for SQS/Step/Lambda are quite low

Handle transitions between BatchAMI versions

AWS Batch currently has no mechanism to add custom initialisation when provisioning instances. This feature does exist in ECS, so it seems likely it will eventually be added to Batch as well. Once this is possible it will be possible to dispense entirely with the custom AMI.

Until that time, the custom AMI (which is preconfigured for the specific EFS volume) is required. To upgrade from one AMI to another it will be necessary to delete and create new EC2/spot compute environments. This can't be done while jobs are active. An approach which adds the new compute environments, switches over to them and them removed the old one once it was drained would be logical.

Note: To avoid building an AMI for each Minerva deployment it would have been nice to be able to use environment variables to inform each instance what EFS volume to mount, but unfortunately there is no mechanism to supply the Batch instances with environment variables either.

Optimise EFS Synchronisation

It might be possible to optimise the EFS synchronisation step by only syncing data that could possibly be recognised by the software used in the scan and subsequent extraction steps.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.