The argo-data-submission from icgc-argo

Add check to verify existence of `matchedNormalSubmitterId`

On-fly Md5sum and filesize generation

To help out Xindi's file submission, we'll add new steps into pipeline to on-fly md5sum and filesize calculation for payload generation. These features will be hidden behind flags for developer use only and dependency so payload generation will trigger after ega-downloads

Updated pay-load-gen-seq to support CRAM replacement

Feature Request - add support to singularity

Investigation of adding `dryrun` option to workflow

Instead of providing test projects or sandbox to users, we can add dryrun option to the workflow, it will run the stub scripts to test all services are accessible.

Score parameters not being passed correctly

CPUs and memory are not being passed into score b/c the correct variables are not being set:
https://github.com/icgc-argo/argo-data-submission/blob/main/argo-data-submission-wf/main.nf#L129-L142

Possible Implementation

  'score_transport_mem' : params.mem,
  'score_mem' : params.mem,
  'score_cpus': params.cpus,
  'song_mem' : params.mem,
  'song_cpus': params.cpus,

Bump `validate-seqtools`

Bump validate-seq-tools version to use newest seq-tools as latest sequencing-experiment schema change was reorganized breaking seq-tools

Add step in sanity check to verify consistent studyID in params and experiment.tsv

Currently --study-id in nextflow command is only utilized in upload.

This value can be different than what is supplied in the JSON or TSV payloads.

Add a check to make sure the two are consistent.

Update data submission workflow to reflect the dictionary changes in `experiment`

Convert JSON Payloads to TSV for General Data Submission

@edsu7 - To add details

Feature Request - add step to generate receipt for successful submission

loose the output file type restriction in tool sanityCheck

In the tool sanity check, the expected output file updated_experiment_info_tsv must have suffix tsv, can we loose the file type restriction?
What if the user prepare their metadata with txt suffix?

🐛 `Decrypt-aspera` add renaming

Based on the following example:

type	name	format	path	ega_file_id	md5sum	size
sequencing_file	DTB-097-Progression-cfDNA_1.fq.gz	FASTQ	EGAD00001008460/PART_2/prod_ega-box-1043_EGAR00003306459_DTB-097-Progression-cfDNA_1.fq.gz.1649803262196.c4gh	EGAF00006144300	1fa88491833d65e1a4f7e8a08e458322	169568827826
sequencing_file	DTB-097-Progression-cfDNA_2.fq.gz	FASTQ	EGAD00001008460/PART_2/prod_ega-box-1043_EGAR00003306459_DTB-097-Progression-cfDNA_2.fq.gz.1649809921842.c4gh	EGAF00006144301	f42d0e69efcef37d8c7b883b6ddce3e1	191746184535

File prod_ega-box-1043_EGAR00003306459_DTB-097-Progression-cfDNA_1.fq.gz.1649803262196 is successfully downloaded and decrypted but will fail decrypt-aspera b/c suffix is not recognized.

Suggested fix : rename file to workout silly naming schemes.

Add non-committal run

Add option so that run goes through the whole pipeline but stops at uploading step.

missing column in `ega-download-wf` results in generic message

In ega-download-wf lines L93-L94, if columns ega_file_id or path cannot be found nextflow will output a generic error message.

Cannot invoke method size() on null object
 -- Check script 'GitHub/icgc-argo/argo-data-submission/argo-data-submission-wf/./wfpr_modules/github.com/icgc-argo/argo-data-submission/[email protected]/main.nf' at line: 108 or see '.nextflow.log' file for more details

User experience can be improved with a clearer message

Support for CRAM submission

Handle `validation.report.UNKNOWN.json` in seq-tools

Currently seq-tools will report step failed if validation.report.INVALId.json is detected. Otherwise will look for validation.report.valid.json

When the script errors out, can generate validation.report.UNKNOWN.json which is reported as a PASS in nextflow

Update molecular metadata dictionary and template

Since the proposed changes to experiment table have been reviewed and approved by DCMWG (May 31st, 2023), we will go ahead to update metadata dictionary and template under folder: https://github.com/icgc-argo/argo-data-submission/tree/main/metadata_dictionary

Adjust the default values of the workflow to avoid requesting from users

some default values for

song_url: https://submission-song.rdpc.cancercollaboratory.org
score_url: https://submission-score.rdpc.cancercollaboratory.org
cpus: 2
mem: 4
max_retries: 2

Add params `--force` to let the song-score-upload to overwrite the previously uploaded objects

Context:
Feature request from MUTO-INTL, there was a failed upload due to score-client version. After upgrade the score-client version, they will need to use score's --force params to overwrite the existing uploaded objects.

The submission workflow and song-score-upload subworkflow need to expose that param in order to support the feature.

Force mandatory usage of profiles

Investigate if profile usage can be forced to avoid users running the workflows without proper profiles

Update seq-tools to support nested CRAM info check

Given the example :

  "files": [
    {
      "fileName": "anon_chr1_completeA.bam",
      "fileSize": 269442,
      "fileMd5sum": "968711f781f217dfb1de630e520ccacb",
      "fileType": "BAM",
      "fileAccess": "controlled",
      "dataType": "Submitted Reads",
      "info": {
        "data_category": "Sequencing Reads",
        "original_cram_info": {
          "fileName": "anon_chr1_completeA.cram",
          "fileSize": 132507,
          "fileMd5sum": "0d1776d44d8da87758d6b159aa8e6bc5",
          "fileType": "CRAM",
          "referenceFileName": "hs73d5.fa.gz"
        }
      }
    }

Seq-tools currently only checks files[0][fileSize] and files[0][fileMd5sum].
For CRAM files that have their original info nested and converted version is auto-populates files[0][fileSize] and files[0][fileMd5sum] .
This potentially allows a submitter to submit incorrect CRAM info and bypass md5sum and fileSize checks.

Update Seq-tools
Release updated seq-tools version
Update Seq-tools WFPM module

Miss to catch the invalid value in schema validation

Although the field experiment/platform in submission-song.rdpc-qa sequencing_experiment schema has an enum definition

"platform": {
 "enum": [
  "CAPILLARY",
  "LS454",
  "ILLUMINA",
  "SOLID",
  "HELICOS",
  "IONTORRENT",
  "ONT",
  "PACBIO",
  "Nanopore",
  "BGI"
]

but all the submission payloads pass the schema validation when the field was provided as:

"plzatform": "Illumina",

Plus both submission workflow and the song server validations did not catch the errors.

Print the validation report from `seq-tools` in the end

Consider to add the information of the validation report generated from seq-tools in the final step, as that the users are aware of the validation results, like whether it is

PASS
PASS-with-WARNING
PASS-with-SKIPPED-check
PASS-with-WARNING-and-SKIPPED-check
So that they can decide whether further actions are needed.

Fix Docker - Mount Song/Score log folder to container work folder

Update pyega3 to latest version and add `--connections`

🐛 Sanity check failure

Describe the bug

Running the following argo-data submission workflow without duplicate check succeeds when normally should fail

nextflow run ../main.nf -params-file local-test-job-2-InputFastq.json -profile debug_qa,docker --api_token ${token}
N E X T F L O W  ~  version 22.04.0
Launching `../main.nf` [goofy_einstein] DSL2 - revision: 8a2de252d7
WARN: Nextflow version 22.04.0 does not match workflow required version: >=22.10.0 -- Execution will continue, but things may break!
executor >  local (10)
[26/9981f3] process > ArgoDataSubmissionWf:sanityCheck                                             [100%] 1 of 1 ✔
executor >  local (10)
[26/9981f3] process > ArgoDataSubmissionWf:sanityCheck                                             [100%] 1 of 1 ✔
[a4/7ca8b6] process > ArgoDataSubmissionWf:checkCramReference                                      [100%] 1 of 1 ✔
[c6/f5956f] process > ArgoDataSubmissionWf:pGenExp                                                 [100%] 1 of 1 ✔
[f6/6bff5c] process > ArgoDataSubmissionWf:valSeq                                                  [100%] 1 of 1 ✔
[89/245ea3] process > ArgoDataSubmissionWf:uploadWf:songSub (TEST-QA)                              [100%] 1 of 1 ✔
[9c/145e3c] process > ArgoDataSubmissionWf:uploadWf:songMan (5c43282e-3fb7-4877-8328-2e3fb7d8777e) [100%] 1 of 1 ✔
[d8/e3b7b5] process > ArgoDataSubmissionWf:uploadWf:scoreUp (5c43282e-3fb7-4877-8328-2e3fb7d8777e) [100%] 1 of 1 ✔
[a0/0ee02c] process > ArgoDataSubmissionWf:uploadWf:songPub (5c43282e-3fb7-4877-8328-2e3fb7d8777e) [100%] 1 of 1 ✔
[b0/59e24c] process > ArgoDataSubmissionWf:submissionReceipt                                       [100%] 1 of 1 ✔
[8d/52ecee] process > ArgoDataSubmissionWf:printOut                                                [100%] 1 of 1 ✔

Payload JSON File : /Users/esu/Desktop/GitHub/icgc-argo/argo-data-submission/argo-data-submission-wf/tests/work/c6/f5956f0d92d1e14daa5061be911a32/68211b84-d07b-4292-b52c-fe6311d79cb9.sequencing_experiment.payload.jsonAnalysis ID : 5c43282e-3fb7-4877-8328-2e3fb7d8777e
Submission TSV Receipt: /Users/esu/Desktop/GitHub/icgc-argo/argo-data-submission/argo-data-submission-wf/tests/work/b0/59e24c8b054a8284f0925c1fac3731/5c43282e-3fb7-4877-8328-2e3fb7d8777e_submission_receipt.tsv

This is b/c in sanity check:

argo-data-submission/sanity-check/main.py

Lines 188 to 201 in ed5dcdb

    
           for analysis in response.json(): 
        
               ### If analysis is suppressed; we ignore 
        
               if \ 
        
               analysis["analysisState"]=="PUBLISHED" and \ 
        
               analysis["experiment"]["experimental_strategy"]==metadata.get('experimental_strategy') and \ 
        
               analysis['analysisType']['name']=="sequencing_experiment"\ 
        
               : 
        
                   sys.exit( 
        
                       "Sample '%s'/'%s' has an existing published analysis '%s' for experiment_strategy '%s.'" 
        
                       % \ 
        
                       (metadata.get('submitter_sample_id'),metadata.get('sample_id'),analysis['analysisId'],metadata.get('experimental_strategy')) 
        
                   ) 
        
               else: 
        
                   return True

Evaluates on first instance of an analysis. In the event where a sample has multiple analyses, and the first analysis in the list will determine sanity pass/fail.

So for SA624380 that has a mixture of UNPUBLISHED and PUBLISHED analyses:

01d4e350-32e8-4731-94e3-5032e82731db UNPUBLISHED
026e7dbd-8a7b-4ee1-ae7d-bd8a7b0ee120 PUBLISHED
03d2714a-e860-48ad-9271-4ae860d8ad3d PUBLISHED
068c76dd-3806-4e0e-8c76-dd38064e0ea6 PUBLISHED
06e7cb7d-d5bd-4227-a7cb-7dd5bdc22740 UNPUBLISHED

B/C the first instance 01d4e350-32e8-4731-94e3-5032e82731db is UNPUBLISHED sanity check misses flagging this sample

https://github.com/icgc-argo/argo-data-submission/blob/main/nextflow.config

	for analysis in response.json():
	### If analysis is suppressed; we ignore
	if \
	analysis["analysisState"]=="PUBLISHED" and \
	analysis["experiment"]["experimental_strategy"]==metadata.get('experimental_strategy') and \
	analysis['analysisType']['name']=="sequencing_experiment"\
	:
	sys.exit(
	"Sample '%s'/'%s' has an existing published analysis '%s' for experiment_strategy '%s.'"
	% \
	(metadata.get('submitter_sample_id'),metadata.get('sample_id'),analysis['analysisId'],metadata.get('experimental_strategy'))
	)
	else:
	return True

icgc-argo / argo-data-submission Goto Github PK

argo-data-submission's People

Contributors

Watchers

argo-data-submission's Issues

Possible Implementation

Describe the bug

Recommend Projects

Recommend Topics

Recommend Org