ga4gh / workflow-execution-service-schemas Goto Github PK

The WES API is a standard way to run and manage portable workflows.

License: Apache License 2.0

workflow cwl wdl ga4gh

workflow-execution-service-schemas's Introduction

Workflow Execution Service (WES) API

This repository is the home for the schema for the GA4GH Workflow Execution Service API. The Goal of the API is to provide a standardized way to submit and manage workflows described in a workflow language (eg. WDL, CWL, Nextflow, Galaxy, Snakemake) against an execution backend.

See the human-readable Reference Documentation You can also explore the specification in the Swagger Editor** Manually load the JSON if working from a non-develop branch version. Preview documentation from the gh-openapi-docs for the development branch here

All documentation and pages hosted at 'ga4gh.github.io/workflow-execution-service' reflect the latest API release from the master branch. To monitor the latest development work, add 'preview/<branch>' to the URLs above (e.g., ' ga4gh.github.io/ga4gh.github.io/workflow-execution-service/preview/<branch>/docs'). To view the latest stable development API specification, refer to the develop branch.

The Global Alliance for Genomics and Health is an international coalition, formed to enable the sharing of genomic and clinical data.

Cloud Work Stream

The Cloud Work Stream helps the genomics and health communities take full advantage of modern cloud environments. Our initial focus is on “bringing the algorithms to the data”, by creating standards for defining, sharing, and executing portable workflows.

We work with platform development partners and industry leaders to develop standards that will facilitate interoperability.

What is WES?

The Workflow Execution Service API describes a standard programmatic way to run and manage workflows. Having this standard API supported by multiple execution engines will let people run the same workflow using various execution platforms running on various clouds/environments. Key features include:

ability to request a workflow run using CWL or WDL
ability to parameterize that workflow using a JSON schema
ability to get information about running workflows

Use Cases

Use cases include:

"Bring your code to the data": a researcher who has built their own custom analysis can submit it to run on a dataset owned by an external organization, instead of having to make a copy of the data
Best-practices pipelines: a researcher who maintains their own controlled data environment can find useful workflows in a shared directory (e.g., Dockstore.org), and run them over their data

Starter Kit

If you are a future implementor or would like to start using a WES API locally you can try the GA4GH WES Starter Kit. This project provides a fully functioning WES API written in java and allows you to run workflows using the Nextflow workflow language.

Possible Future Enhancements

common JSON parameterization format that works with CWL and WDL
validation service for testing WES implementations' conformance to the spec
improved tools for troubleshooting execution failures, especially when there are 100s-1000s of tasks
a callback mechanism for monitoring status changes in running workflows (e.g., a webhook)
integration with GA4GH data access APIs (e.g., htsget, DOS)

How to Contribute Changes

See CONTRIBUTING.md.

If a security issue is identified with the specification, please send an email to [email protected] detailing your concerns.

License

See the LICENSE.

More Information

workflow-execution-service-schemas's People

Contributors

Stargazers

Watchers

workflow-execution-service-schemas's Issues

WES Publication Plan

Plan for a 2-page application note style (Bioinformatics?)

Remove tag search

Decided as not required for initial use cases. A feature like this will probably be provided by some system of record with a long standing WES.

To close, remove mentions in the schema of this feature

https://github.com/ga4gh/workflow-execution-service-schemas/blob/develop/openapi/workflow_execution_service.swagger.yaml#L101

More granular `supported_filesystem_protocols` in service info

Currently this is just an array:

      supported_filesystem_protocols:
        type: array
        items:
          type: string
        description: |-
          The filesystem protocols supported by this service, currently these may include common
          protocols such as 'http', 'https', 'sftp', 's3', 'gs', 'file', 'synapse', or others as
          supported by this service.

However, the WES server might (is likely to?) support different protocols than the environment where individual tasks are executed. For example, the Broad Cromwell WES we used for the Toronto Testbed Demo was set up on an EC2 instance and could receive workflow contents (descriptors, parameters) via http; however, tasks were run via PAPI and thus required inputs/outputs as GS bucket URLs. For v1.0 of the spec, we're also proposing to store workflow and task logs at accessible (to the client) URLs, and so some protocol hint might be useful for those as well.

I'm thinking we could change this property to an object with keys indicating supported protocols for workflows, tasks, and logs, respectively:

{"workflows": ["http", "file"], "tasks": ["gs"], "logs": ["gs"]}

This might still be an oversimplification (e.g., if a WES endpoint supports multiple task execution backends, then each might support a different fs protocol...) — but it would at least provide an initial barrier to prevent a client from submitting incompatible jobs.

URLs for stderr/stdout

There is a mixture of systems being used for reporting errors and logs was identified as being in need of improvement from feedback in the Proof of Concept Testbed application.

Some discussion on this underway in PR#30.

Workflow execution documention

Is there a document providing a detailed description for each end-point parameters and data structures?

In particular I'm looking for the description of the workflow execution request data structure. The protobuf reports the following object:

{
  "workflow_descriptor": "string",
  "workflow_params": "string",
  "workflow_type": "string",
  "workflow_type_version": "string",
  "key_values": {
    "additionalProp1": "string",
    "additionalProp2": "string",
    "additionalProp3": "string"
  }
}

However without further details it's not clear the parameters semantic (apart workflow_descriptor .. ) and how they are supposed to be used.

General Exec Summary of WES

The GA4GH Product Approval Process stipulates the specification should be in a format such as Markdown, and lays out suggested sections for the document. The specification will be prepared in Google Docs and then added as a MD file here.

Investigate Multi-part Uploads

Multi-part uploads are required, primarily for nested WDLs. Started discussion in PR#31

Document versioning

The current versioning scheme does not follow semantic versioning. We should move towards a semantic version to make it clear when there are breaking changes.

Add a limit to paging

A decision was made at a GA4GH Cloud meeting to add limit to API where
means “max # of results returned by this call”
is already in the spec -- we like that

Being added in PR#30

Update the Orchestrator for the Basel Testbed with final API

Once the WES API 1.0 has been settled the PoC Testbed needs to be updated to use this

Orchestrator: wire up HCA WES endpoint

For a demonstration of the GA4GH Cloud Work Stream Testbed Interopability Platform in Basel, a WES endpoint is to be set up to be accessible from the to-be-selected orchestrator by HCA

Use protobuf/grpc as schema/api format

Can I suggest we use Protobuf/gRPC as the primary format for encoding schemas/APIs? The Task schemas did this, with Swagger and client code generated from that, and it worked nicely. This repo seems to be going in a different direction, and it would be nice to stay consistent.

If I understand correctly, ga4gh schemas use the protobuf format: https://github.com/ga4gh/schemas/blob/master/CONTRIBUTING.rst

Add swagger tests

For example, checking schema validity.

Could work like this:
https://github.com/ga4gh/data-object-service-schemas/blob/master/python/test/test_package.py

Demonstrate implementations before merging schema changes

When trying to work on schemas in this manner it is helpful to have a working implementation before making merging a schema change. For this we have the https://github.com/common-workflow-language/workflow-service , which is currently lagging behind the state of the schemas making testing and reviewing further changes a challenge.

Although we always hope to put off breaking changes, there have been a number in the last few merges. To help curtail this, I suggest we require a demonstrated implementation to go with schema changes. That does not mean the workflow-service decides which WES endpoints are developed. Just that schematic changes are in a way proven before being accepted here. Driver projects are still empowered to suggest changes and push issues forward.

This might allow us to come up with a release pattern so that semantic versions of WES can be easily associated with workflow-service versions.

This would be a change to the contributor guidelines and would require documenting a process of reviewing features in development branches of the workflow-service when reviewing schematic changes here.

@tetron @geoffjentry @dglazer

Progress monitoring

There are use cases for retrieving intermediate information from a workflow, e.g.:

What step (of a multistep workflow) is in progress and what is its percent complete? (A user may wish to cancel a workflow if it's too slow.)
How far or how well has my machine learning model converged? (Again, a poorly progressing model might be canceled.)

Under the existing API, such in-progress information would come from retrieving and reading or parsing the workflow's log files. If the information could be returned in a more structured form (e.g. in a set of key-value pairs) then, when used with #21 a client could create a dashboard of running workflows and/or answer questions like "which of my jobs is closest to complete"?

Orchestrator: wire up TopMed WES endpoint

For a demonstration of the GA4GH Cloud Work Stream Testbed Interopability Platform in Basel, a WES endpoint is to be set up to be accessible from the to-be-selected orchestrator by TopMed

Identify ambiguities from the specification, plan to resolve

The specification has ambiguities that need to be identified. These can then be resolved in PRs.

Finalise names for WES

A survey is currently up to determine the naming schema for WES. This needs to be completed, a decision made and the changes implemented.

Fill out the Product Approval Submission form

Background: https://docs.google.com/document/d/1UUJSnsPw32W5r1jaJ0vI11X0LLLygpAC9TNosjSge_w/edit#heading=h.tyqycskyykwh

The actual form: https://docs.google.com/forms/d/e/1FAIpQLScn8Rngp8DuH3XtUNSn4nsM8pPGlX9Ya9OuogLWTfzIhKsJNA/viewform

We can put the answers in the issues and then have one of us to fill out the form (since we can't group edit this).

Add travis badge

Workflow for Basel Testbed demo from HCA

For a demonstration of the GA4GH Cloud Work Stream Testbed Interopability Platform in Basel, a workflow is required to be loaded into Dockstore from the Human Cell Atlas Driver Project

Error handling

The response messages for a bad requests, not found resources, etc. are not described in the schema.

Can this be done directly in protobuf? Should this be done via modifying the OpenAPI description directly?

Auth paragraphs need to be added into the spec

Security issues around authentication and authorization need to be resolved and added into the specification.

Clarify how the WES API is used by multiple, independent users sharing a back end

Approaches might include:

No fine-grained authorization: Anyone who can access a WES compliant workflow engine has access to all the workflows in that system;
All WES APIs require authentication and the authorized user has access to just the workflow that they created. When a user asks to 'list all workflows' the list includes just the ones they created;
Users can share access to workflows they create with other users. This may require additional APIs in WES or, alternatively, access control could be deemed out-of-scope for WES.

Remove server/client

The server and client in this repo are aspirational and should be removed and replaced with the workflow-service's server and client, or a simpler demo server should be provided for creating progress on the schemas.

https://github.com/ga4gh/workflow-execution-service-schemas/blob/develop/python/ga4gh/wes/controllers.py

https://github.com/common-workflow-language/workflow-service

Schema proposal discussion

Regarding the design sketch here: https://github.com/ga4gh/workflow-execution-schemas/blob/proposal/Flow.md

Some possible flavour changes:

Unify GET and POST models. When POSTing a new job (i.e. WorkflowInstance or TaskInstance), we can use the same model (with a subset of fields) that is returned when querying for status. The example submission POST data would look like this:

{
  "tool": "http://example.com/tools/bwa/version/1.0/descriptor",
  "input": {
    "stringparameter": "value",
    "fileparameter": {
      "class": "File",
      "location": "http://storage.example.com/bucket/file1.fq"
    }
  }
}

POST to collection endpoint, not to /submit. So, we'd have POST http://example.com/workflows and GET http://example.com/workflows/:id.
A more suitable endpoint name may be /jobs, but is usually irrelevant (implementation detail?).
When representing files, replace class key with $type to avoid possible conflicts with record values.
Split the queued state into pending and ready to differentiate between jobs that are waiting for dependency jobs and jobs that may be waiting for e.g. compute resources.
We may want to show inputs and outputs of subjobs. For this it may make sense to introduce rootId and parentId properties to each job (including top-level jobs), then let people query with e.g. example.com/jobs?parentId=:id to traverse the tree rather than include it in the top-level job.

Review workflow-service `Logs PR`

common-workflow-language/workflow-service#21

Fix schema version

At some point, the schema version listed in the swagger YAML got off track: the latest develop branch shows 0.2.2, even though there's a 0.3.0 release available for the repo.

I've noticed that @david4096 and @natanlao have been doing a lot of work to improve packaging, versioning, building, etc. for the DOS schemas — maybe they could advise here.

contact_info incorrectly indented in workflow_execution_service.swagger.yaml

Branch: develop (67729e4).

The workflow_execution_service.swagger.yaml cannot be processed through swagger-cli because of the following error. The bug does not appear in d1cb6aa

./node_modules/.bin/swagger-cli validate workflow-execution-service-schemas/openapi/workflow_execution_service.swagger.yaml

Error parsing /work/ilmn/git/_other/workflow-execution-service-schemas/openapi/workflow_execution_service.swagger.yaml
bad indentation of a mapping entry at line 352, column 6:
         contact_info:
         ^

Workflow for Basel Testbed demo from TopMed

For a demonstration of the GA4GH Cloud Work Stream Testbed Interopability Platform in Basel, a workflow is required to be loaded into Dockstore from TopMed

Ontologies for defining computer resources

rewriting as link unrelated, does an ontology exist of computer systems that would make it easier to define resource requests?

Throttling access to WES

Steps can be taken to help prevent a Denial of Service attack

Data Retention Policy

A section Guidelines should be provided to advise implementers on Data Retention Policy. This issue provides a space for this discussion to take place. A PR can then be raised with the changes in place.

Environment specific parameters

From @mckinsel , should we, or how do we schematize the parameters for requesting resources (disk space, CPU count, memory)? Can this be pushed down to the CWL/WDL?

Setup key-value filtering of workflow list

We allow key-value "tags" on workflow runs. These are most useful if the client can filter on them. The current description for "tag_search" is too vague. This adds query parameters tag_key and tag_value and describes their usage more precisely.

See PR#33 for more info

Data access credentials

If the execution environment needs to access data using API keys, can we offer an interoperable way to achieve this? Are these workflow parameters to the workflow?

For example, if I want to run a workflow that accesses a protected S3 bucket, how do I authorize the WES instance to access those data in an interoperable way when running its workflow.

Add contributing document

Add a contributing document to help clarify community engagement.

Bump to OpenAPI 3.0

With major substantive and structural changes happening to the spec en route to v1.0, I wonder if now would also be the best time to bite the bullet and convert to the latest OpenAPI spec. Not sure how much work would be involved, or whether the change would have any major impact on service/client implementors.

Suggest that implementors use HTTP Link header field to indicate provenance when retrieving results

https://www.w3.org/TR/prov-aq/#resource-accessed-by-http

Idea is from @stain

Workflow for Basel Testbed demo from AGHA

For a demonstration of the GA4GH Cloud Work Stream Testbed Interopability Platform in Basel, a workflow is required to be loaded into Dockstore from AGHA

Individual WES endpoint vulnerability reporting contact email

For issues with individual WES endpoints, we will add a point of contact email to the API service info to enable users to report issues and vulnerabilities.

Request a batch of status messages

Paraphrased from Bruce Hoff [email protected]:

I would like a request that allows me to return a list of status messages for each ID requested, instead of performing get requests for each workflow being tracked.

Orchestrator: wire up AGHA WES endpoint

For a demonstration of the GA4GH Cloud Work Stream Testbed Interopability Platform in Basel, a WES endpoint is to be set up to be accessible from the to-be-selected orchestrator by AGHA.

WES Reg and Ethics WS Form submitted

A copy of the Regulartory and Ethics WS Product Approval form needs to be set up for the Cloud WS to work on and then submit as part of the Product Approval Submission.

Generate MD from Swagger

To be included in the final document

Data Security Product Approval Submission Form

A copy of the Data Security WS Product Submission Approval form needs to be set up for the Cloud WS to work on and then submit as part of the Product Approval Submission. This will wait until the form that has already been submitted has been reviewed

File upload

When setting up some WDL workflows one must provide a set of dependencies that are not derived from the WDL itself. This feature should allow one to usefully send initialization files for a workflow.

CWL didn't come from GA4GH, please make this clear in the README

"From within this group, two approaches have emerged, resulting in the production of two distinct but complementary specifications: the Common Workflow Language, or CWL, and the Workflow Description Language, or WDL. "

Additionally:

The characterization that "[t]he CWL approach emphasizes execution features and machine-readability, and serves a core target audience of software and platform developers. The WDL approach, on the other hand, emphasizes scripting and human-readability, and serves a core target audience of research scientists" doesn't seem to be true, nor serve a useful purpose and should be removed.

Thanks!

Supported Filesystem Protocols

From @kellrott

We should have a discussion about syncing this concept with the one used in TES: https://github.com/ga4gh/task-execution-schemas/blob/master/task_execution.proto#L527

Proposal: defining new common schema to describe concrete job execution plan

Not sure this is the right place to start this type of discussion, but here it is.

I feel I shouldn't be alone wondering if there is anything we can do to deal with the situation that there are so many different workflow definition options. CWL, WDL, Toil, Galaxy, Airflow and Nextflow just to name a few used in the bioinformatics world. I don't expect any of them go away, it is important to have diversified choices for workflow authors. However, there are increasing demands for a particular workflow execution engine to be able to run workflows defined in different workflow languages. Instead of each of the six execution engines (mentioned earlier) writing five different parsers, totally 30 parsers, is there a better way to do it?

My thought is that it would greatly reduce the effort if we can come up with a new common schema to describe concrete job execution plan that can be compiled from a workflow defined using any of the existing workflow languages. With this approach, each workflow language will only need to write one converter to convert its own execution plan into the new common schema. Then if an execution engine would like to support workflows written in other languages, it just needs to implement the capability to execute the 'common schema'.

Is this feasible? We don't necessarily need to cover all workflow languages, being able to support the most popular two: CWL and WDL should be good enough and probably a good starting point.

Chatted with @geoffjentry this afternoon, he seems agree with this approach.

Thoughts?