Giter Site home page Giter Site logo

workflow-execution-service-schemas's Introduction

ga4gh logo

release_badge

Workflow Execution Service (WES) API

This repository is the home for the schema for the GA4GH Workflow Execution Service API. The Goal of the API is to provide a standardized way to submit and manage workflows described in a workflow language (eg. WDL, CWL, Nextflow, Galaxy, Snakemake) against an execution backend.

See the human-readable Reference Documentation You can also explore the specification in the Swagger Editor** Manually load the JSON if working from a non-develop branch version. Preview documentation from the gh-openapi-docs for the development branch here

All documentation and pages hosted at 'ga4gh.github.io/workflow-execution-service' reflect the latest API release from the master branch. To monitor the latest development work, add 'preview/<branch>' to the URLs above (e.g., ' ga4gh.github.io/ga4gh.github.io/workflow-execution-service/preview/<branch>/docs'). To view the latest stable development API specification, refer to the develop branch.

The Global Alliance for Genomics and Health is an international coalition, formed to enable the sharing of genomic and clinical data.

Cloud Work Stream

The Cloud Work Stream helps the genomics and health communities take full advantage of modern cloud environments. Our initial focus is on “bringing the algorithms to the data”, by creating standards for defining, sharing, and executing portable workflows.

We work with platform development partners and industry leaders to develop standards that will facilitate interoperability.

What is WES?

The Workflow Execution Service API describes a standard programmatic way to run and manage workflows. Having this standard API supported by multiple execution engines will let people run the same workflow using various execution platforms running on various clouds/environments. Key features include:

  • ability to request a workflow run using CWL or WDL
  • ability to parameterize that workflow using a JSON schema
  • ability to get information about running workflows

Use Cases

Use cases include:

  • "Bring your code to the data": a researcher who has built their own custom analysis can submit it to run on a dataset owned by an external organization, instead of having to make a copy of the data
  • Best-practices pipelines: a researcher who maintains their own controlled data environment can find useful workflows in a shared directory (e.g., Dockstore.org), and run them over their data

Starter Kit

If you are a future implementor or would like to start using a WES API locally you can try the GA4GH WES Starter Kit. This project provides a fully functioning WES API written in java and allows you to run workflows using the Nextflow workflow language.

Possible Future Enhancements

  • common JSON parameterization format that works with CWL and WDL
  • validation service for testing WES implementations' conformance to the spec
  • improved tools for troubleshooting execution failures, especially when there are 100s-1000s of tasks
  • a callback mechanism for monitoring status changes in running workflows (e.g., a webhook)
  • integration with GA4GH data access APIs (e.g., htsget, DOS)

How to Contribute Changes

See CONTRIBUTING.md.

If a security issue is identified with the specification, please send an email to [email protected] detailing your concerns.

License

See the LICENSE.

More Information

workflow-execution-service-schemas's People

Contributors

achave11-ucsc avatar aniewielska avatar briandoconnor avatar david4096 avatar delagoya avatar denis-yuen avatar dglazer avatar geoffjentry avatar jaeddy avatar junjun-zhang avatar kellrott avatar mr-c avatar patmagee avatar rishidev avatar ruchim avatar susheel avatar tetron avatar uniqueg avatar vsmalladi avatar wleepang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

workflow-execution-service-schemas's Issues

More granular `supported_filesystem_protocols` in service info

Currently this is just an array:

      supported_filesystem_protocols:
        type: array
        items:
          type: string
        description: |-
          The filesystem protocols supported by this service, currently these may include common
          protocols such as 'http', 'https', 'sftp', 's3', 'gs', 'file', 'synapse', or others as
          supported by this service.

However, the WES server might (is likely to?) support different protocols than the environment where individual tasks are executed. For example, the Broad Cromwell WES we used for the Toronto Testbed Demo was set up on an EC2 instance and could receive workflow contents (descriptors, parameters) via http; however, tasks were run via PAPI and thus required inputs/outputs as GS bucket URLs. For v1.0 of the spec, we're also proposing to store workflow and task logs at accessible (to the client) URLs, and so some protocol hint might be useful for those as well.

I'm thinking we could change this property to an object with keys indicating supported protocols for workflows, tasks, and logs, respectively:

{"workflows": ["http", "file"], "tasks": ["gs"], "logs": ["gs"]}

This might still be an oversimplification (e.g., if a WES endpoint supports multiple task execution backends, then each might support a different fs protocol...) — but it would at least provide an initial barrier to prevent a client from submitting incompatible jobs.

URLs for stderr/stdout

There is a mixture of systems being used for reporting errors and logs was identified as being in need of improvement from feedback in the Proof of Concept Testbed application.

Some discussion on this underway in PR#30.

Workflow execution documention

Is there a document providing a detailed description for each end-point parameters and data structures?

In particular I'm looking for the description of the workflow execution request data structure. The protobuf reports the following object:

{
  "workflow_descriptor": "string",
  "workflow_params": "string",
  "workflow_type": "string",
  "workflow_type_version": "string",
  "key_values": {
    "additionalProp1": "string",
    "additionalProp2": "string",
    "additionalProp3": "string"
  }
}

However without further details it's not clear the parameters semantic (apart workflow_descriptor .. ) and how they are supposed to be used.

Document versioning

The current versioning scheme does not follow semantic versioning. We should move towards a semantic version to make it clear when there are breaking changes.

Add a limit to paging

A decision was made at a GA4GH Cloud meeting to add limit to API where
means “max # of results returned by this call”
is already in the spec -- we like that

Being added in PR#30

Orchestrator: wire up HCA WES endpoint

For a demonstration of the GA4GH Cloud Work Stream Testbed Interopability Platform in Basel, a WES endpoint is to be set up to be accessible from the to-be-selected orchestrator by HCA

Demonstrate implementations before merging schema changes

When trying to work on schemas in this manner it is helpful to have a working implementation before making merging a schema change. For this we have the https://github.com/common-workflow-language/workflow-service , which is currently lagging behind the state of the schemas making testing and reviewing further changes a challenge.

Although we always hope to put off breaking changes, there have been a number in the last few merges. To help curtail this, I suggest we require a demonstrated implementation to go with schema changes. That does not mean the workflow-service decides which WES endpoints are developed. Just that schematic changes are in a way proven before being accepted here. Driver projects are still empowered to suggest changes and push issues forward.

This might allow us to come up with a release pattern so that semantic versions of WES can be easily associated with workflow-service versions.

This would be a change to the contributor guidelines and would require documenting a process of reviewing features in development branches of the workflow-service when reviewing schematic changes here.

@tetron @geoffjentry @dglazer

Progress monitoring

There are use cases for retrieving intermediate information from a workflow, e.g.:

  • What step (of a multistep workflow) is in progress and what is its percent complete? (A user may wish to cancel a workflow if it's too slow.)
  • How far or how well has my machine learning model converged? (Again, a poorly progressing model might be canceled.)

Under the existing API, such in-progress information would come from retrieving and reading or parsing the workflow's log files. If the information could be returned in a more structured form (e.g. in a set of key-value pairs) then, when used with #21 a client could create a dashboard of running workflows and/or answer questions like "which of my jobs is closest to complete"?

Orchestrator: wire up TopMed WES endpoint

For a demonstration of the GA4GH Cloud Work Stream Testbed Interopability Platform in Basel, a WES endpoint is to be set up to be accessible from the to-be-selected orchestrator by TopMed

Workflow for Basel Testbed demo from HCA

For a demonstration of the GA4GH Cloud Work Stream Testbed Interopability Platform in Basel, a workflow is required to be loaded into Dockstore from the Human Cell Atlas Driver Project

Error handling

The response messages for a bad requests, not found resources, etc. are not described in the schema.

Can this be done directly in protobuf? Should this be done via modifying the OpenAPI description directly?

Clarify how the WES API is used by multiple, independent users sharing a back end

Approaches might include:

  1. No fine-grained authorization: Anyone who can access a WES compliant workflow engine has access to all the workflows in that system;
  2. All WES APIs require authentication and the authorized user has access to just the workflow that they created. When a user asks to 'list all workflows' the list includes just the ones they created;
  3. Users can share access to workflows they create with other users. This may require additional APIs in WES or, alternatively, access control could be deemed out-of-scope for WES.

Schema proposal discussion

Regarding the design sketch here: https://github.com/ga4gh/workflow-execution-schemas/blob/proposal/Flow.md

Some possible flavour changes:

  1. Unify GET and POST models. When POSTing a new job (i.e. WorkflowInstance or TaskInstance), we can use the same model (with a subset of fields) that is returned when querying for status. The example submission POST data would look like this:
{
  "tool": "http://example.com/tools/bwa/version/1.0/descriptor",
  "input": {
    "stringparameter": "value",
    "fileparameter": {
      "class": "File",
      "location": "http://storage.example.com/bucket/file1.fq"
    }
  }
}
  1. POST to collection endpoint, not to /submit. So, we'd have POST http://example.com/workflows and GET http://example.com/workflows/:id.

  2. A more suitable endpoint name may be /jobs, but is usually irrelevant (implementation detail?).

  3. When representing files, replace class key with $type to avoid possible conflicts with record values.

  4. Split the queued state into pending and ready to differentiate between jobs that are waiting for dependency jobs and jobs that may be waiting for e.g. compute resources.

  5. We may want to show inputs and outputs of subjobs. For this it may make sense to introduce rootId and parentId properties to each job (including top-level jobs), then let people query with e.g. example.com/jobs?parentId=:id to traverse the tree rather than include it in the top-level job.

Fix schema version

At some point, the schema version listed in the swagger YAML got off track: the latest develop branch shows 0.2.2, even though there's a 0.3.0 release available for the repo.

I've noticed that @david4096 and @natanlao have been doing a lot of work to improve packaging, versioning, building, etc. for the DOS schemas — maybe they could advise here.

contact_info incorrectly indented in workflow_execution_service.swagger.yaml

Branch: develop (67729e4).

The workflow_execution_service.swagger.yaml cannot be processed through swagger-cli because of the following error. The bug does not appear in d1cb6aa

./node_modules/.bin/swagger-cli validate workflow-execution-service-schemas/openapi/workflow_execution_service.swagger.yaml
Error parsing /work/ilmn/git/_other/workflow-execution-service-schemas/openapi/workflow_execution_service.swagger.yaml
bad indentation of a mapping entry at line 352, column 6:
         contact_info:
         ^

Data Retention Policy

A section Guidelines should be provided to advise implementers on Data Retention Policy. This issue provides a space for this discussion to take place. A PR can then be raised with the changes in place.

Data access credentials

If the execution environment needs to access data using API keys, can we offer an interoperable way to achieve this? Are these workflow parameters to the workflow?

For example, if I want to run a workflow that accesses a protected S3 bucket, how do I authorize the WES instance to access those data in an interoperable way when running its workflow.

Bump to OpenAPI 3.0

With major substantive and structural changes happening to the spec en route to v1.0, I wonder if now would also be the best time to bite the bullet and convert to the latest OpenAPI spec. Not sure how much work would be involved, or whether the change would have any major impact on service/client implementors.

Orchestrator: wire up AGHA WES endpoint

For a demonstration of the GA4GH Cloud Work Stream Testbed Interopability Platform in Basel, a WES endpoint is to be set up to be accessible from the to-be-selected orchestrator by AGHA.

WES Reg and Ethics WS Form submitted

A copy of the Regulartory and Ethics WS Product Approval form needs to be set up for the Cloud WS to work on and then submit as part of the Product Approval Submission.

Data Security Product Approval Submission Form

A copy of the Data Security WS Product Submission Approval form needs to be set up for the Cloud WS to work on and then submit as part of the Product Approval Submission. This will wait until the form that has already been submitted has been reviewed

File upload

When setting up some WDL workflows one must provide a set of dependencies that are not derived from the WDL itself. This feature should allow one to usefully send initialization files for a workflow.

CWL didn't come from GA4GH, please make this clear in the README

"From within this group, two approaches have emerged, resulting in the production of two distinct but complementary specifications: the Common Workflow Language, or CWL, and the Workflow Description Language, or WDL. "

Additionally:

The characterization that "[t]he CWL approach emphasizes execution features and machine-readability, and serves a core target audience of software and platform developers. The WDL approach, on the other hand, emphasizes scripting and human-readability, and serves a core target audience of research scientists" doesn't seem to be true, nor serve a useful purpose and should be removed.

Thanks!

Proposal: defining new common schema to describe concrete job execution plan

Not sure this is the right place to start this type of discussion, but here it is.

I feel I shouldn't be alone wondering if there is anything we can do to deal with the situation that there are so many different workflow definition options. CWL, WDL, Toil, Galaxy, Airflow and Nextflow just to name a few used in the bioinformatics world. I don't expect any of them go away, it is important to have diversified choices for workflow authors. However, there are increasing demands for a particular workflow execution engine to be able to run workflows defined in different workflow languages. Instead of each of the six execution engines (mentioned earlier) writing five different parsers, totally 30 parsers, is there a better way to do it?

My thought is that it would greatly reduce the effort if we can come up with a new common schema to describe concrete job execution plan that can be compiled from a workflow defined using any of the existing workflow languages. With this approach, each workflow language will only need to write one converter to convert its own execution plan into the new common schema. Then if an execution engine would like to support workflows written in other languages, it just needs to implement the capability to execute the 'common schema'.

Is this feasible? We don't necessarily need to cover all workflow languages, being able to support the most popular two: CWL and WDL should be good enough and probably a good starting point.

Chatted with @geoffjentry this afternoon, he seems agree with this approach.

Thoughts?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.