Giter Site home page Giter Site logo

wfexs-backend's Introduction

WfExS-backend: Workflow Execution Service backend DOI

WfExS (which could be pronounced like "why-fex", "why-fix" or "why-fixes") project aims to automate next steps:

  • Fetch and cache a workflow from either:
    • A TRSv2-enabled WorkflowHub instance (which provides RO-Crates).
    • A TRSv2 (2.0.0-beta2 or 2.0.0) enabled service. Currently tested with Dockstore.
    • A straight URL to an existing RO-Crate in ZIP archive describing a workflow.
    • A git repository (using this syntax for the URI)
    • A public GitHub URL (like this example).
  • Identify the kind of workflow.
  • Fetch and set up workflow execution engine (currently supported Nextflow and cwltool).
  • Identify the needed containers by the workflow, and fetch/cache them. Depending on the local setup, singularity, apptainer, docker, podman or none of them will be used.
  • Fetch and cache the inputs, represented either through an URL or a CURIE-represented PID (public persistent identifier).
  • Execute the workflow in a secure way, if it was requested.
  • Optionally describe the results through an RO-Crate, and upload both RO-Crate and the results elsewhere in a secure way.

Relevant docs:

  • INSTALL.md: In order to use WfExS-backend you have to install first at least core dependencies described there.

  • TODO.md: This development is relevant for projects like EOSC-Life or EJP-RD. The list of high level scheduled and pending developments can be seen at .

  • README_LIFECYCLE.md: WfExS-backend analysis lifecycle and usage scenarios are briefly described with flowcharts there.

  • README_REPLICATOR.md: It briefly describes WfExS-config-replicator.py usage.

Additional present and future documentation is hosted at development-docs subfolder, until it is migrated to a proper documentation service.

Presentations and outreach

Fernández JM, Rodríguez-Navas L and Capella-Gutiérrez S. Secured and annotated execution of workflows with WfExS-backend [version 1; not peer reviewed]. F1000Research 2022, 11:1318 (poster) (https://doi.org/10.7490/f1000research.1119198.1)

Laura Rodríguez-Navas (2021): WfExS: a software component to enable the use of RO-Crate in the EOSC-Life collaboratory.
FAIR Digital Object Forum, CWFR & FDO SEM meeting, 2021-07-02 [video recording], [slides]

Laura Rodríguez-Navas (2021):
WfExS: a software component to enable the use of RO-Crate in the EOSC-Life tools collaboratory.
EOSC Symposium 2021, 2021-06-17 [video recording] [slides

Salvador Capella-Gutierrez (2021):
Demonstrator 7: Accessing human sensitive data from analytical workflows available to everyone in EOSC-Life
Populating EOSC-Life: Success stories from the demonstrators, 2021-01-19. https://www.eosc-life.eu/d7/ [video] [slides]

Bietrix, Florence; Carazo, José Maria; Capella-Gutierrez, Salvador; Coppens, Frederik; Chiusano, Maria Luisa; David, Romain; Fernandez, Jose Maria; Fratelli, Maddalena; Heriche, Jean-Karim; Goble, Carole; Gribbon, Philip; Holub, Petr; P. Joosten, Robbie; Leo, Simone; Owen, Stuart; Parkinson, Helen; Pieruschka, Roland; Pireddu, Luca; Porcu, Luca; Raess, Michael; Rodriguez- Navas, Laura; Scherer, Andreas; Soiland-Reyes, Stian; Tang, Jing (2021):
EOSC-Life Methodology framework to enhance reproducibility within EOSC-Life.
EOSC-Life deliverable D8.1, Zenodo https://doi.org/10.5281/zenodo.4705078

WfExS-backend Usage

An automatically generated description of the command line directives is available at the CLI section of the documentation.

Also, a description about the different WfExS commands is available at the command line section of the documentation.

Configuration files

The program uses three different types of configuration files:

  • Local configuration file: YAML formatted file which describes the local setup of the backend (example at workflow_examples/local_config.yaml). JSON Schema describing the format (and used for validation) is available at wfexs_backend/schemas/config.json and there is also automatically generated documentation (see config_schema.md). Relative paths in this configuration file use as reference the directory where the local configuration file is living.

    • cacheDir: The path in this key sets up the place where all the contents which can be cached are hold. It contains downloaded RO-Crate, downloaded workflow git repositories, downloaded workflow engines. It is recommended to have it outside /tmp directory when Singularity is being used, due undesirable side interactions with the way workflow engines use Singularity.

    • workDir: The path in this key sets up the place where all the executions are going to store both intermediate and final results, having a separate directory for each execution. It is recommended to have it outside /tmp directory when Singularity is being used, due undesirable side interactions with the way workflow engines use Singularity.

    • crypt4gh.key: The path to the secret key used in this installation. It is paired to crypt4gh.pub.

    • crypt4gh.pub: The path to the public key used in this installation. It is paired to crypt4gh.key.

    • crypt4gh.passphrase: The passphrase needed to decrypt the contents of crypt4gh.key.

    • tools.engineMode: Currently, local mode only.

    • tools.containerType: Currently, singularity, docker or podman.

    • tools.gitCommand: Path to git command (only used when needed)

    • tools.dockerCommand: Path to docker command (only used when needed)

    • tools.singularityCommand: Path to singularity command (only used when needed)

    • tools.podmanCommand: Path to podman command (only used when needed)

    • tools.javaCommand: Path to java command (only used when needed)

    • tools.encrypted_fs.type: Kind of FUSE encryption filesystem to use for secure working directories. Currently, both gocryptfs and encfs are supported.

    • tools.encrypted_fs.command: Command path to be used to mount the secure working directory. The default depends on value of tools.encrypted_fs.type.

    • tools.encrypted_fs.fusermount_command: Command to be used to unmount the secure working directory. Defaults to fusermount.

    • tools.encrypted_fs.idle: Number of minutes of inactivity before the encrypted FUSE filesystem is automatically unmounted. The default is 5 minutes.

  • Workflow configuration file: YAML formatted file which describes the workflow staging before being executed, like where inputs are located and can be fetched, the security contexts to be used on specific inputs to get those controlled access resources, the parameters, the outputs to capture, ... (Nextflow example, CWL example). JSON Schema describing the format and valid keys (and used for validation), is available at wfexs_backend/schemas/stage-definition.json and there is also automatically generated documentation (see stage-definition_schema.md).

  • Security contexts file: YAML formatted file which holds the user/password pairs, security tokens or keys needed on different steps, like input fetching. (Nextflow example, CWL example). JSON Schema describing the format and valid keys (and used for validation), is available at wfexs_backend/schemas/security-context.json and there is also automatically generated documentation (see security-context_schema.md).

License

  • © 2020-2024 Barcelona Supercomputing Center (BSC), ES

Licensed under the Apache License, version 2.0 https://www.apache.org/licenses/LICENSE-2.0, see the file LICENSE for details.

wfexs-backend's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wfexs-backend's Issues

Add support to `ga4ghdos` CURIE

Data Object Service standard allows using a common identifier to locate resources which are replicated among several cloud services, as it is described at https://registry.identifiers.org/registry/ga4ghdos . For instance, ga4ghdos:dg.4503/01b048d0-e128-4cb0-94e9-b2d2cab7563d can be queried as

https://dataguids.org/ga4gh/dos/v1/dataobjects/dg.4503/01b048d0-e128-4cb0-94e9-b2d2cab7563d

In the obtained JSON the urls section contains the links to the different replicas of the dataset, which could be FTP, HTTP(S), S3 or Google Cloud URIs.

Add support for `insdc.sra` CURIE

Many public projects, like 1000genomes, publish their genomes at SRA repository, which is mirrored at NCBI, EBI and DDBJ. The idea is adding support to insdc.sra compact URI scheme, providing all the download links based on the different mirrors.

Add support for swh permanent identifiers

Software Heritage swh permanent identifiers, described at https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#interoperability , should be supported by WfExS-backend, as they can be used in two different ways.

First one, there are repos there which could contain workflows, so a method to fetch those workflows should be implemented.

Second one, they provide a standardized way to compute a stable identifier for directories. Although there is an available implementation at https://pypi.org/project/swh.model/ , due license collisions (it is GPLv3) a reimplementation of the algorithm is needed.

Error while running WfExS using a local workflow file/directory

Description

I am running WfExS with the config files shown below. When I run WfExS-backend.py -L local-config.yml stage -W test-stage.yml I get the following error: NotADirectoryError: [Errno 20] Not a directory: '/root/wfexs-backend-test_WorkDir/47761fdd-f06f-4260-a1f3-7351265805b3/workflow'.

Looking at the path in the error message, it seems workflow is the file in workflow_id in the stage file. However, WfExS is expecting there to be a directory. I also tried putting a path to a directory in the workflow_id field, but that failed saying it couldn't work out which runner to use.

Traceback

Traceback (most recent call last):
File "/root/WfExS-backend/WfExS-backend.py", line 21, in
main()
File "/root/WfExS-backend/wfexs_backend/main.py", line 1122, in main
stagedSetup = wfInstance.stageWorkDir()
File "/root/WfExS-backend/wfexs_backend/workflow.py", line 1985, in stageWorkDir
self.materializeWorkflowAndContainers(offline=offline, ignoreCache=ignoreCache)
File "/root/WfExS-backend/wfexs_backend/workflow.py", line 1233, in materializeWorkflowAndContainers
self.setupEngine(offline=offline, ignoreCache=ignoreCache)
File "/root/WfExS-backend/wfexs_backend/workflow.py", line 1191, in setupEngine
self.fetchWorkflow(
File "/root/WfExS-backend/wfexs_backend/workflow.py", line 1152, in fetchWorkflow
engineVer, candidateLocalWorkflow = engine.identifyWorkflow(
File "/root/WfExS-backend/wfexs_backend/cwl_engine.py", line 316, in identifyWorkflow
newLocalWf = self._enrichWorkflowDeps(newLocalWf, engineVer)
File "/root/WfExS-backend/wfexs_backend/cwl_engine.py", line 542, in _enrichWorkflowDeps
with subprocess.Popen(
File "/usr/lib/python3.10/subprocess.py", line 969, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.10/subprocess.py", line 1845, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
NotADirectoryError: [Errno 20] Not a directory: '/root/wfexs-backend-test_WorkDir/47761fdd-f06f-4260-a1f3-7351265805b3/workflow'

Settings

Stage file

# test-stage.yml
workflow_id: file:///root/hutch/workflows/sec-hutchx86.cwl
workflow_config:
  container: 'docker'
  secure: false
nickname: 'vas-workflow'
cacheDir: /tmp/wfexszn6siq2jtmpcache
crypt4gh:
  key: cosifer_test1_cwl.wfex.stage.key
  passphrase: mpel nite ified g
  pub: cosifer_test1_cwl.wfex.stage.pub
outputs:
  output_file:
    c-l-a-s-s: File
    glob: "output.json"
params:
  body:
    c-l-a-s-s: File
    url:
      - https://raw.githubusercontent.com/HDRUK/hutch/main/workflows/inputs/rquest-query.json
  is_availability: true
  db_host: "localhost"
  db_name: "hutch"
  db_user: "postgres"
  db_password: "example"

Local config

# local-config.yml
cacheDir: ./wfexs-backend-test
crypt4gh:
  key: local_config.yaml.key
  passphrase: strive backyard dividing gumball
  pub: local_config.yaml.pub
tools:
  containerType: docker
  dockerCommand: docker
  encrypted_fs:
    command: encfs
    type: encfs
  engineMode: local
  gitCommand: git
  javaCommand: java
  singularityCommand: singularity
  staticBashCommand: bash-linux-x86_64
workDir: ./wfexs-backend-test_WorkDir

Warn about `scrypt` crypt4gh keys

Library crypt4gh can generate and use keys based on different algorithms. One of them is scrypt, which depends on very specific features from OpenSSL used to compile python interpreter.

https://github.com/EGA-archive/crypt4gh/blob/2ba98a7cea96e8fb337b17310cc1a226ad3b3e65/crypt4gh/keys/kdf.py#L29-L43

As this algorithm availability is very dependent on the version of OpenSSL, WfExS-backend should:

  1. Emit a warning each time the conditions where it could fail happen: OpenSSL < 1.1.0 and key generated with scrypt.
  2. Generate new keys always using a different algorithm, like bcrypt, which is not so sensitive to used OpenSSL version on python interpreter compilation.

WfExS-backend init issues

WfExS-backend init should create valid yaml configuration files when --cache-dir parameter is provided. Also, it should validate already existing configuration files against the corresponding JSON schema.

An example of the bad behaviour:

(.pyWEenv) jmfernandez@pavonis[14]:~/projects/WfExS-backend> python WfExS-backend.py --cache-dir /tmp/gorrito -L prueba2.yaml init
[WARNING] Configuration file prueba2.yaml does not exist
[WARNING] Cache directory not defined. Created a temporary one at /tmp/wfexsrkoltayctmpcache
2024-01-31 10:54:02,182 - [WARNING] [WARNING] Installation key file /home/jmfernandez/projects/WfExS-backend/prueba2.yaml.key does not exist
2024-01-31 10:54:02,182 - [WARNING] [WARNING] Installation pub file /home/jmfernandez/projects/WfExS-backend/prueba2.yaml.pub does not exist
* Storing updated configuration at prueba2.yaml
(.pyWEenv) jmfernandez@pavonis[15]:~/projects/WfExS-backend> cat prueba2.yaml
cache-directory: /tmp/gorrito
cacheDir: /tmp/wfexsrkoltayctmpcache
crypt4gh:
  key: prueba2.yaml.key
  passphrase: ndcart ndredth ndline elling
  pub: prueba2.yaml.pub

Cannot download content from ftp

Dear WfExS-Team,
I was testing WfExS on my local WSL2/Ubuntu.
Set up of core and further dependencies on a conda environment worked without any trouble.
Though during of the test workflow
python3 WfExS-backend.py execute -W tests/wetlab2variations_execution_nxf_secure.wfex.stage
I got the following error:


[ERROR] Cannot download content from ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140407_D00360_0017_BH947YADXX/Project_RM8398/Sample_U5c/U5c_CCGTCC_L001_R1_001.fastq.gz to 42be63ef9b0fc7d80d09513bfd3fa42b2288fd9b (while processing LicensedURI(uri='ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140407_D00360_0017_BH947YADXX/Project_RM8398/Sample_U5c/U5c_CCGTCC_L001_R1_001.fastq.gz', licences=('https://choosealicense.com/no-permission/',), attributions=[], secContext=None)) (temp file /tmp/wfexsivum2b3rtmpcache/wf-inputs/caching-5f6ef9b7-b9b8-4f40-b38e-9ac854ef5ec3): can only concatenate str (not "NoneType") to str
Traceback (most recent call last):
  File "WfExS-backend.py", line 445, in <module>
    main()
  File "WfExS-backend.py", line 429, in main
    wfInstance.stageWorkDir()
  File "/home/valentin/wfexs/WfExS-backend/wfexs_backend/workflow.py", line 1027, in stageWorkDir
    self.materializeInputs()
  File "/home/valentin/wfexs/WfExS-backend/wfexs_backend/workflow.py", line 809, in materializeInputs
    theParams, numInputs = self.fetchInputs(
  File "/home/valentin/wfexs/WfExS-backend/wfexs_backend/workflow.py", line 1008, in fetchInputs
    newInputsAndParams, lastInput = self.fetchInputs(inputs,
  File "/home/valentin/wfexs/WfExS-backend/wfexs_backend/workflow.py", line 932, in fetchInputs
    matContent = self.wfexs.downloadContent(
  File "/home/valentin/wfexs/WfExS-backend/wfexs_backend/wfexs_backend.py", line 980, in downloadContent
    inputKind, cachedFilename, metadata_array, cachedLicences = self.cacheHandler.fetch(remote_file, workflowInputs_destdir, offline, ignoreCache, registerInCache, secContext)
  File "/home/valentin/wfexs/WfExS-backend/wfexs_backend/cache_handler.py", line 549, in fetch
    raise CacheHandlerException(errmsg) from nested_exception
wfexs_backend.cache_handler.CacheHandlerException: Cannot download content from ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140407_D00360_0017_BH947YADXX/Project_RM8398/Sample_U5c/U5c_CCGTCC_L001_R1_001.fastq.gz to 42be63ef9b0fc7d80d09513bfd3fa42b2288fd9b (while processing LicensedURI(uri='ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140407_D00360_0017_BH947YADXX/Project_RM8398/Sample_U5c/U5c_CCGTCC_L001_R1_001.fastq.gz', licences=('https://choosealicense.com/no-permission/',), attributions=[], secContext=None)) (temp file /tmp/wfexsivum2b3rtmpcache/wf-inputs/caching-5f6ef9b7-b9b8-4f40-b38e-9ac854ef5ec3): can only concatenate str (not "NoneType") to str

No VPN was activated or anything else that could have prevent the fastq from download.

wget ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140407_D00360_0017_BH947YADXX/Project_RM8398/Sample_U5c/U5c_CCGTCC_L001_R1_001.fastq.gz
was working though.
Do you have any ideas how to solve it?

Allow using as a staging source a Workflow Run RO-Crate

The target here is just WfExS-backend should be able to consume its own RO-Crates, demonstrating true reproducibility.

This feature is divided in two milestones:

  • Being able to reuse as much metadata as possible, so inputs, commits and containers are reused.
  • Being able to reuse RO-Crate bundled copies of workflow, inputs and containers in the instantiation.

This last one can bring issues related to docker containers, as it might imply reassigning local container tags.

Bug in path resolution in local config file

Description

When running the following command: WfExS-backend/WfExS-backend.py -L local-config.yml execute -W test-stage.yml I got the following error message:

schema_salad.exceptions.ValidationException: Not found: '/root//root/wfexs-backend-test_WorkDir/efb98299-cb1f-48f8-862e-7a8746bba1a4/workflow/workflows/sec-hutchx86.cwl'

The path resolution appears to have added an additional /root/ to the front of the path in local-config.yml (see below). When I changed the workDir to ./wfexs-backend-test_WorkDir, the execution appeared to proceed as expected and I saw this in the logging output:

materialized workflow repository (checkout 6d500ca1396283faae2ce5eebf778500dd8be2da): /root/wfexs-backend-test_WorkDir/f51c9984-8e43-49fa-a03b-8e683e884980/workflow

The path resolves as would be expected if I ran WfExS from /root.

Local config file

cacheDir: $HOME/wfexs-backend-test
crypt4gh:
  key: local_config.yaml.key
  passphrase: strive backyard dividing gumball
  pub: local_config.yaml.pub
tools:
  containerType: podman
  dockerCommand: docker
  podmanCommand: podman
  encrypted_fs:
    command: encfs
    type: encfs
  engineMode: local
  gitCommand: git
  javaCommand: java
  singularityCommand: singularity
  staticBashCommand: bash-linux-x86_64
workDir: $HOME/wfexs-backend-test_WorkDir

Stage file

workflow_id: https://raw.githubusercontent.com/HDRUK/hutch/main/workflows/sec-hutchx86.cwl
workflow_config:
  container: 'podman'
  secure: false
nickname: 'vas-workflow'
cacheDir: /tmp/wfexszn6siq2jtmpcache
crypt4gh:
  key: cosifer_test1_cwl.wfex.stage.key
  passphrase: mpel nite ified g
  pub: cosifer_test1_cwl.wfex.stage.pub
outputs:
  output_file:
    c-l-a-s-s: File
    glob: "output.json"
params:
  body:
    c-l-a-s-s: File
    url:
      - https://raw.githubusercontent.com/HDRUK/hutch/main/workflows/inputs/rquest-query.json
  is_availability: true
  db_host: "localhost"
  db_name: "hutch"
  db_user: "postgres"
  db_password: "example"

Add several checks in the code to detect containers unavailable for the current hardware architecture

Thanks to the tests from @dcl10 some issues have been uncovered related to workflows which depend on container images which are not available for the current processor architecture.

A way to reproduce the chain of issues is trying to execute cosifer workflow, which depends on a single container prepared for x86_64 / amd64 architecture, in a different architecture like linux arm64.

cosifer "toy" workflow uses a single custom container which is only available for x86_64. WfExS-backend tries materializing the container by itself, and most probably it is doing wrongly despite the architecture mismatch, but it should have complained before even trying to run cwltool. So, when cwltool tries running it, it is surely failing because either the previously materialized container is for the wrong architecture, or because cwltool is not able to fetch any container suitable for the task. So, cwltool is returning an empty description of its outputs which is deserialized to None instead to a dictionary, and the code is failing trying to access key "class" because None is not a dictionary.

Also, the caching directory should have container images directory per supported architecture, so it can hold cached versions for x86_64 and arm64, in case the caching directory is used in an heterogeneous HPC environment.

TypeError: Multiple inheritance with NamedTuple is not supported

Hi!

On Ubuntu LTS, with miniconda and Python 3.9.13, I cannot run WfExS. It fails with the following traceback:

(venv) kinow@ranma:~/Development/python/workspace/WfExS-backend$ python WfExS-backend.py --full-help
Traceback (most recent call last):
  File "/home/kinow/Development/python/workspace/WfExS-backend/WfExS-backend.py", line 39, in <module>
    from wfexs_backend.wfexs_backend import WfExSBackend
  File "/home/kinow/Development/python/workspace/WfExS-backend/wfexs_backend/wfexs_backend.py", line 57, in <module>
    from .common import AbstractWfExSException
  File "/home/kinow/Development/python/workspace/WfExS-backend/wfexs_backend/common.py", line 288, in <module>
    class GeneratedContent(AbstractGeneratedContent, NamedTuple):
  File "/home/kinow/Development/python/miniconda3/lib/python3.9/typing.py", line 1929, in _namedtuple_mro_entries
    raise TypeError("Multiple inheritance with NamedTuple is not supported")
TypeError: Multiple inheritance with NamedTuple is not supported

It looks like this could be related to the following issue:

I think it was first released with 3.9.0-alpha6. Given this is a change in Python, I guess WfExS will have to update the code eventually to support Py 3.9+. This patch fixes the initial command, but I am not sure if it doesn't break something else 👍

diff --git a/wfexs_backend/common.py b/wfexs_backend/common.py
index 56878fd..a51dd7f 100644
--- a/wfexs_backend/common.py
+++ b/wfexs_backend/common.py
@@ -285,7 +285,7 @@ class ExpectedOutput(NamedTuple):
 class AbstractGeneratedContent(object):
     pass
 
-class GeneratedContent(AbstractGeneratedContent, NamedTuple):
+class GeneratedContent(AbstractGeneratedContent):
     """
     local: Local absolute path of the content which was generated. It
       is an absolute path in the outputs directory of the execution.
@@ -302,7 +302,7 @@ class GeneratedContent(AbstractGeneratedContent, NamedTuple):
     secondaryFiles: Optional[Sequence[AbstractGeneratedContent]] = None
 
 
-class GeneratedDirectoryContent(AbstractGeneratedContent, NamedTuple):
+class GeneratedDirectoryContent(AbstractGeneratedContent):
     """
     local: Local absolute path of the content which was generated. It
       is an absolute path in the outputs directory of the execution.

cwl_engine management of arrays of inputs

There is an issue with workflow https://raw.githubusercontent.com/kids-first/kf-alignment-workflow/v2.7.3/workflows/kfdrc_alignment_wf.cwl , leading to error message

inputdeclarations.yaml:2:1:  * the `input_bam_list` field is not valid because value is a CommentedMap, expected null or array of <File>

due input_bam_list not being properly represented.

cwl_engine.CWLWorkflowEngine generates file inputdeclarations.yaml before calling cwltool, in order to tell it the input parameters and where to find the files.

That yaml is created by createYAMLFile

def createYAMLFile(self, matInputs, cwlInputs, filename):
"""
Method to create a YAML file that describes the execution inputs of the workflow
needed for their execution. Return parsed inputs.
"""
try:
execInputs = self.executionInputs(matInputs, cwlInputs)
if len(execInputs) != 0:
with open(filename, mode="w+", encoding="utf-8") as yaml_file:
yaml.dump(execInputs, yaml_file, allow_unicode=True, default_flow_style=False, sort_keys=False)
return execInputs
else:
raise WorkflowEngineException(
"Dict of execution inputs is empty")
except IOError as error:
raise WorkflowEngineException(
"ERROR: cannot create YAML file {}, {}".format(filename, error))

which depends on the output from executionInputs

def executionInputs(self, matInputs: List[MaterializedInput], cwlInputs):
"""
Setting execution inputs needed to execute the workflow
"""
if len(matInputs) == 0: # Is list of materialized inputs empty?
raise WorkflowEngineException("FATAL ERROR: Execution with no inputs")
if len(cwlInputs) == 0: # Is list of declared inputs empty?
raise WorkflowEngineException("FATAL ERROR: Workflow with no declared inputs")
execInputs = dict()
for matInput in matInputs:
if isinstance(matInput, MaterializedInput): # input is a MaterializedInput
# numberOfInputs = len(matInput.values) # number of inputs inside a MaterializedInput
for input_value in matInput.values:
name = matInput.name
value_type = cwlInputs.get(name, {}).get('type')
if value_type is None:
raise WorkflowEngineException("ERROR: input {} not available in workflow".format(name))
value = input_value
if isinstance(value, MaterializedContent): # value of an input contains MaterializedContent
if value.kind in (ContentKind.Directory, ContentKind.File):
if not os.path.exists(value.local):
self.logger.warning("Input {} is not materialized".format(name))
value_local = value.local
if isinstance(value_type, dict): # MaterializedContent is a List of File
classType = value_type['items']
execInputs.setdefault(name, []).append({"class": classType, "location": value_local})
else: # MaterializedContent is a File
classType = value_type
execInputs[name] = {"class": classType, "location": value_local}
else:
raise WorkflowEngineException(
"ERROR: Input {} has values of type {} this code does not know how to handle".format(name, value.kind))
else:
execInputs[name] = value
return execInputs

Use pyinvoke and Fabric

Next example shows how to issue commands which can be either remotely or locally run https://stackoverflow.com/a/55704170 based on both pyinvoke and Fabric libraries.

Past the 1.0 milestone, WfExS-backend is going to gain different non-raw execution scenarios, like in-container runs, runs with different users, remote runs through ssh, remote runs through a queue system (first, monolithic, later, spread) and remote runs through GA4GH TES and WES.

A way to seamless integrate this is first transitioning to use both pyinvoke and Fabric, so local and ssh executions are seamless integrated, and then trying to extend it to the other execution environments.

Can't execute workflows using podman

Description

Using stage I can stage a workflow with podman. However running the workflow with staged-workdir offline-exec I get the following error:

ERROR Workflow error:
Docker is not available for this tool, try --no-container to disable Docker, or install a user space Docker replacement like uDocker with --user-space-docker-cmd.: Docker image hutchstack/rquest-omop-worker:next not found

Fiddling with the code on a fork, I found adding --no-container or --user-space-docker-cmd isn't compatible with --podman.

In cwl_engine.py I found that commenting out the --disable-pull line seemed to fix the problem and the workflow runs as expected. However, I guess the --disable-pull is there for a good reason. Could something be preventing WfExS from looking where the podman image is saved for the staged image?

Parsing of Nextflow DSL2 workflows

Right now, Nextflow workflow source is parsed in order to learn the needed containers. The approach is not fail proof, as the container declaration can depend on variables, and in the case of DSL2 workflows, the declarations can be spread over several files.

So, at least, it is needed to parse all the (sub)workflow files involved.

Add support for compact `drs` identifiers

As of https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.1.0/docs/#_appendix_compact_identifier_based_uris , drs URIs can be in compact form, which add an additional level of indirection resolving where the DRS server is living against either n2t.net or identifiers.org . Implementation added at 11d6873 does not consider this level of indirection, and it is not able to know whether it is dealing with a compact DRS URI or a not so compact one.

Secrets/secret inputs

Background

My team would like to use WfExS in Trusted Research Environment (TRE) which has data sources which can't be exposed to the outside world. We anticipate that the environment will contain variables which must be kept secret, i.e. not in the output RO-Crate). In some cases, some inputs may also be sensitive and we would like them not to be included in the output RO-Crate either.

Proposed Feature

For secret environment variables, would it be possible to add a section in the local config yaml file where we could put the variables as key-value pairs and then have WfExS load these into the local environment at runtime? Then during the creation of the RO-Crate, check for the secrets and exclude them from the crate and its metadata?

For secret inputs. would it be possible to add to the definition of an input a boolean flag to tell WfExS whether that input is secret or not? Then similarly to above have it excluded from the crate and its metadata.

PermissionError: [Errno 13] Permission denied: '/home/ansible/wfexs-backend-test/wf-cache'

Description

When staging a workflow for the first time as non-sudo/root user I'm getting the error in the title. Oddly, deleting the cache dir and re-running the stage command seems to fix the issue. As does adding write permissions with chown. I'm not sure if it has anything to do with the calls to os.makesirs or a umask issue.

Here's something I was reading about the problem. Not sure if it will be helpful. https://stackoverflow.com/questions/5231901/permission-problems-when-creating-a-dir-with-os-makedirs-in-python/67723702#67723702

Record the licence of the workflow in RO-Crate

When a workflow is fetched from a git repository or an RO-Crate pointing to a repository, the licence file of the workflow repository should be included in generated RO-Crates, in case it exists.

Add metadata related to fetched URIs

Right now WfExS does not keep a correspondence between URLs and downloaded files, as the filenames are hashes generated from the URL. But there are several scenarios where additional upstream metadata is available, and future cases where a single URL corresponds to a collection of files. An example of this last one, an ENCODE Experiment id or EGA dataset id correspond to more than one file, maybe with their independent download URL.

So, there should be an intermediate metadata layer, where these correspondences and upstream metadata are kept. After this change, name of cached files should be the sha256 of their content, and URIs should translate to JSON files named as the hash of the URI, containing the correspondences to cached files, and their origins.

Last, but not the least important, upstream metadata should be gathered and preserved in the execution provenance

`dot` dependency should be optional

Right now, when a prospective RO-Crate is generated, dot is used to translate the workflow representation generated by the workflow engine into a PNG. When the command is not available or properly installed, the generation of the RO-Crate fails.

Publish a major release with DOI

Now we have a CITATION.cff (as of #13 ), we have to publish a major release with a DOI generated by Zenodo, in order to add that DOI to CITATION.cff.

That major release should be fired by a major event.

Add validation capabilities over fetched contents

Today I have found the scenario where some content fetched from FTP was corrupted through the download process. There are several validation mechanisms which can be integrated into WfExS-backend:

  • When a file is a known compressed archive (tar, gz, bzip2, xz, zip), its integrity should be checked.
  • When a file is signed, and a public signing key is available, check the file was not tampered.
  • Declaring a file to be fetched containing MD5 or SHA1 sums or signatures of the fetched contents.
  • Declaring inline fields containing the MD5 or SHA1 sums of the fetched contents.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.