icgc-argo / argo-data-submission Goto Github PK
View Code? Open in Web Editor NEWWorkflow to submit genomic data to ARGO RDPC for processing
License: Other
Workflow to submit genomic data to ARGO RDPC for processing
License: Other
To help out Xindi's file submission, we'll add new steps into pipeline to on-fly md5sum and filesize calculation for payload generation. These features will be hidden behind flags for developer use only and dependency so payload generation will trigger after ega-downloads
Instead of providing test projects or sandbox to users, we can add dryrun
option to the workflow, it will run the stub scripts to test all services are accessible.
CPUs and memory are not being passed into score b/c the correct variables are not being set:
https://github.com/icgc-argo/argo-data-submission/blob/main/argo-data-submission-wf/main.nf#L129-L142
'score_transport_mem' : params.mem,
'score_mem' : params.mem,
'score_cpus': params.cpus,
'song_mem' : params.mem,
'song_cpus': params.cpus,
Bump validate-seq-tools
version to use newest seq-tools as latest sequencing-experiment
schema change was reorganized breaking seq-tools
Currently --study-id
in nextflow command is only utilized in upload.
This value can be different than what is supplied in the JSON or TSV payloads.
Add a check to make sure the two are consistent.
@edsu7 - To add details
In the tool sanity check, the expected output file updated_experiment_info_tsv
must have suffix tsv
, can we loose the file type restriction?
What if the user prepare their metadata with txt
suffix?
Based on the following example:
type | name | format | path | ega_file_id | md5sum | size |
---|---|---|---|---|---|---|
sequencing_file | DTB-097-Progression-cfDNA_1.fq.gz | FASTQ | EGAD00001008460/PART_2/prod_ega-box-1043_EGAR00003306459_DTB-097-Progression-cfDNA_1.fq.gz.1649803262196.c4gh | EGAF00006144300 | 1fa88491833d65e1a4f7e8a08e458322 | 169568827826 |
sequencing_file | DTB-097-Progression-cfDNA_2.fq.gz | FASTQ | EGAD00001008460/PART_2/prod_ega-box-1043_EGAR00003306459_DTB-097-Progression-cfDNA_2.fq.gz.1649809921842.c4gh | EGAF00006144301 | f42d0e69efcef37d8c7b883b6ddce3e1 | 191746184535 |
File prod_ega-box-1043_EGAR00003306459_DTB-097-Progression-cfDNA_1.fq.gz.1649803262196
is successfully downloaded and decrypted but will fail decrypt-aspera
b/c suffix is not recognized.
Suggested fix : rename file
to workout silly naming schemes.
Add option so that run goes through the whole pipeline but stops at uploading step.
In ega-download-wf
lines L93-L94, if columns ega_file_id
or path
cannot be found nextflow will output a generic error message.
Cannot invoke method size() on null object
-- Check script 'GitHub/icgc-argo/argo-data-submission/argo-data-submission-wf/./wfpr_modules/github.com/icgc-argo/argo-data-submission/[email protected]/main.nf' at line: 108 or see '.nextflow.log' file for more details
User experience can be improved with a clearer message
Currently seq-tools will report step failed if validation.report.INVALId.json
is detected. Otherwise will look for validation.report.valid.json
When the script errors out, can generate validation.report.UNKNOWN.json
which is reported as a PASS
in nextflow
Since the proposed changes to experiment
table have been reviewed and approved by DCMWG (May 31st, 2023), we will go ahead to update metadata dictionary and template under folder: https://github.com/icgc-argo/argo-data-submission/tree/main/metadata_dictionary
some default values for
Context:
Feature request from MUTO-INTL, there was a failed upload due to score-client version. After upgrade the score-client version, they will need to use score's --force
params to overwrite the existing uploaded objects.
The submission workflow and song-score-upload subworkflow need to expose that param in order to support the feature.
Investigate if profile usage can be forced to avoid users running the workflows without proper profiles
Given the example :
"files": [
{
"fileName": "anon_chr1_completeA.bam",
"fileSize": 269442,
"fileMd5sum": "968711f781f217dfb1de630e520ccacb",
"fileType": "BAM",
"fileAccess": "controlled",
"dataType": "Submitted Reads",
"info": {
"data_category": "Sequencing Reads",
"original_cram_info": {
"fileName": "anon_chr1_completeA.cram",
"fileSize": 132507,
"fileMd5sum": "0d1776d44d8da87758d6b159aa8e6bc5",
"fileType": "CRAM",
"referenceFileName": "hs73d5.fa.gz"
}
}
}
Seq-tools currently only checks files[0][fileSize]
and files[0][fileMd5sum]
.
For CRAM
files that have their original info nested and converted version is auto-populates files[0][fileSize]
and files[0][fileMd5sum]
.
This potentially allows a submitter to submit incorrect CRAM info and bypass md5sum
and fileSize
checks.
Although the field experiment/platform in submission-song.rdpc-qa sequencing_experiment schema has an enum definition
"platform": {
"enum": [
"CAPILLARY",
"LS454",
"ILLUMINA",
"SOLID",
"HELICOS",
"IONTORRENT",
"ONT",
"PACBIO",
"Nanopore",
"BGI"
]
but all the submission payloads pass the schema validation when the field was provided as:
"plzatform": "Illumina",
Plus both submission workflow and the song server validations did not catch the errors.
Consider to add the information of the validation report generated from seq-tools
in the final step, as that the users are aware of the validation results, like whether it is
Running the following argo-data submission workflow without duplicate check succeeds when normally should fail
nextflow run ../main.nf -params-file local-test-job-2-InputFastq.json -profile debug_qa,docker --api_token ${token}
N E X T F L O W ~ version 22.04.0
Launching `../main.nf` [goofy_einstein] DSL2 - revision: 8a2de252d7
WARN: Nextflow version 22.04.0 does not match workflow required version: >=22.10.0 -- Execution will continue, but things may break!
executor > local (10)
[26/9981f3] process > ArgoDataSubmissionWf:sanityCheck [100%] 1 of 1 โ
executor > local (10)
[26/9981f3] process > ArgoDataSubmissionWf:sanityCheck [100%] 1 of 1 โ
[a4/7ca8b6] process > ArgoDataSubmissionWf:checkCramReference [100%] 1 of 1 โ
[c6/f5956f] process > ArgoDataSubmissionWf:pGenExp [100%] 1 of 1 โ
[f6/6bff5c] process > ArgoDataSubmissionWf:valSeq [100%] 1 of 1 โ
[89/245ea3] process > ArgoDataSubmissionWf:uploadWf:songSub (TEST-QA) [100%] 1 of 1 โ
[9c/145e3c] process > ArgoDataSubmissionWf:uploadWf:songMan (5c43282e-3fb7-4877-8328-2e3fb7d8777e) [100%] 1 of 1 โ
[d8/e3b7b5] process > ArgoDataSubmissionWf:uploadWf:scoreUp (5c43282e-3fb7-4877-8328-2e3fb7d8777e) [100%] 1 of 1 โ
[a0/0ee02c] process > ArgoDataSubmissionWf:uploadWf:songPub (5c43282e-3fb7-4877-8328-2e3fb7d8777e) [100%] 1 of 1 โ
[b0/59e24c] process > ArgoDataSubmissionWf:submissionReceipt [100%] 1 of 1 โ
[8d/52ecee] process > ArgoDataSubmissionWf:printOut [100%] 1 of 1 โ
Payload JSON File : /Users/esu/Desktop/GitHub/icgc-argo/argo-data-submission/argo-data-submission-wf/tests/work/c6/f5956f0d92d1e14daa5061be911a32/68211b84-d07b-4292-b52c-fe6311d79cb9.sequencing_experiment.payload.jsonAnalysis ID : 5c43282e-3fb7-4877-8328-2e3fb7d8777e
Submission TSV Receipt: /Users/esu/Desktop/GitHub/icgc-argo/argo-data-submission/argo-data-submission-wf/tests/work/b0/59e24c8b054a8284f0925c1fac3731/5c43282e-3fb7-4877-8328-2e3fb7d8777e_submission_receipt.tsv
This is b/c in sanity check:
argo-data-submission/sanity-check/main.py
Lines 188 to 201 in ed5dcdb
Evaluates on first instance of an analysis. In the event where a sample has multiple analyses, and the first analysis in the list will determine sanity pass/fail.
So for SA624380
that has a mixture of UNPUBLISHED
and PUBLISHED
analyses:
01d4e350-32e8-4731-94e3-5032e82731db UNPUBLISHED
026e7dbd-8a7b-4ee1-ae7d-bd8a7b0ee120 PUBLISHED
03d2714a-e860-48ad-9271-4ae860d8ad3d PUBLISHED
068c76dd-3806-4e0e-8c76-dd38064e0ea6 PUBLISHED
06e7cb7d-d5bd-4227-a7cb-7dd5bdc22740 UNPUBLISHED
B/C the first instance 01d4e350-32e8-4731-94e3-5032e82731db
is UNPUBLISHED
sanity check misses flagging this sample
See icgc-argo/seq-tools#107, will need to update valSeq
after seq-tools update
Changes to be made to the following configuration file for RDPC Prod/QA/Dev cutover to Cumulus
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.