Giter Site home page Giter Site logo

argo-data-submission's People

Contributors

edsu7 avatar lindaxiang avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

argo-data-submission's Issues

On-fly Md5sum and filesize generation

To help out Xindi's file submission, we'll add new steps into pipeline to on-fly md5sum and filesize calculation for payload generation. These features will be hidden behind flags for developer use only and dependency so payload generation will trigger after ega-downloads

Bump `validate-seqtools`

Bump validate-seq-tools version to use newest seq-tools as latest sequencing-experiment schema change was reorganized breaking seq-tools

๐Ÿ› `Decrypt-aspera` add renaming

Based on the following example:

type name format path ega_file_id md5sum size
sequencing_file DTB-097-Progression-cfDNA_1.fq.gz FASTQ EGAD00001008460/PART_2/prod_ega-box-1043_EGAR00003306459_DTB-097-Progression-cfDNA_1.fq.gz.1649803262196.c4gh EGAF00006144300 1fa88491833d65e1a4f7e8a08e458322 169568827826
sequencing_file DTB-097-Progression-cfDNA_2.fq.gz FASTQ EGAD00001008460/PART_2/prod_ega-box-1043_EGAR00003306459_DTB-097-Progression-cfDNA_2.fq.gz.1649809921842.c4gh EGAF00006144301 f42d0e69efcef37d8c7b883b6ddce3e1 191746184535

File prod_ega-box-1043_EGAR00003306459_DTB-097-Progression-cfDNA_1.fq.gz.1649803262196 is successfully downloaded and decrypted but will fail decrypt-aspera b/c suffix is not recognized.

Suggested fix : rename file to workout silly naming schemes.

Add non-committal run

Add option so that run goes through the whole pipeline but stops at uploading step.

missing column in `ega-download-wf` results in generic message

In ega-download-wf lines L93-L94, if columns ega_file_id or path cannot be found nextflow will output a generic error message.

Cannot invoke method size() on null object
 -- Check script 'GitHub/icgc-argo/argo-data-submission/argo-data-submission-wf/./wfpr_modules/github.com/icgc-argo/argo-data-submission/[email protected]/main.nf' at line: 108 or see '.nextflow.log' file for more details

User experience can be improved with a clearer message

Handle `validation.report.UNKNOWN.json` in seq-tools

Currently seq-tools will report step failed if validation.report.INVALId.json is detected. Otherwise will look for validation.report.valid.json

When the script errors out, can generate validation.report.UNKNOWN.json which is reported as a PASS in nextflow

Update seq-tools to support nested CRAM info check

Given the example :

  "files": [
    {
      "fileName": "anon_chr1_completeA.bam",
      "fileSize": 269442,
      "fileMd5sum": "968711f781f217dfb1de630e520ccacb",
      "fileType": "BAM",
      "fileAccess": "controlled",
      "dataType": "Submitted Reads",
      "info": {
        "data_category": "Sequencing Reads",
        "original_cram_info": {
          "fileName": "anon_chr1_completeA.cram",
          "fileSize": 132507,
          "fileMd5sum": "0d1776d44d8da87758d6b159aa8e6bc5",
          "fileType": "CRAM",
          "referenceFileName": "hs73d5.fa.gz"
        }
      }
    }

Seq-tools currently only checks files[0][fileSize] and files[0][fileMd5sum].
For CRAM files that have their original info nested and converted version is auto-populates files[0][fileSize] and files[0][fileMd5sum] .
This potentially allows a submitter to submit incorrect CRAM info and bypass md5sum and fileSize checks.

  • Update Seq-tools
  • Release updated seq-tools version
  • Update Seq-tools WFPM module

Miss to catch the invalid value in schema validation

Although the field experiment/platform in submission-song.rdpc-qa sequencing_experiment schema has an enum definition

"platform": {
 "enum": [
  "CAPILLARY",
  "LS454",
  "ILLUMINA",
  "SOLID",
  "HELICOS",
  "IONTORRENT",
  "ONT",
  "PACBIO",
  "Nanopore",
  "BGI"
]

but all the submission payloads pass the schema validation when the field was provided as:

"plzatform": "Illumina",

Plus both submission workflow and the song server validations did not catch the errors.

Print the validation report from `seq-tools` in the end

Consider to add the information of the validation report generated from seq-tools in the final step, as that the users are aware of the validation results, like whether it is

  • PASS
  • PASS-with-WARNING
  • PASS-with-SKIPPED-check
  • PASS-with-WARNING-and-SKIPPED-check
    So that they can decide whether further actions are needed.

๐Ÿ› Sanity check failure

Describe the bug

Running the following argo-data submission workflow without duplicate check succeeds when normally should fail

nextflow run ../main.nf -params-file local-test-job-2-InputFastq.json -profile debug_qa,docker --api_token ${token}
N E X T F L O W  ~  version 22.04.0
Launching `../main.nf` [goofy_einstein] DSL2 - revision: 8a2de252d7
WARN: Nextflow version 22.04.0 does not match workflow required version: >=22.10.0 -- Execution will continue, but things may break!
executor >  local (10)
[26/9981f3] process > ArgoDataSubmissionWf:sanityCheck                                             [100%] 1 of 1 โœ”
executor >  local (10)
[26/9981f3] process > ArgoDataSubmissionWf:sanityCheck                                             [100%] 1 of 1 โœ”
[a4/7ca8b6] process > ArgoDataSubmissionWf:checkCramReference                                      [100%] 1 of 1 โœ”
[c6/f5956f] process > ArgoDataSubmissionWf:pGenExp                                                 [100%] 1 of 1 โœ”
[f6/6bff5c] process > ArgoDataSubmissionWf:valSeq                                                  [100%] 1 of 1 โœ”
[89/245ea3] process > ArgoDataSubmissionWf:uploadWf:songSub (TEST-QA)                              [100%] 1 of 1 โœ”
[9c/145e3c] process > ArgoDataSubmissionWf:uploadWf:songMan (5c43282e-3fb7-4877-8328-2e3fb7d8777e) [100%] 1 of 1 โœ”
[d8/e3b7b5] process > ArgoDataSubmissionWf:uploadWf:scoreUp (5c43282e-3fb7-4877-8328-2e3fb7d8777e) [100%] 1 of 1 โœ”
[a0/0ee02c] process > ArgoDataSubmissionWf:uploadWf:songPub (5c43282e-3fb7-4877-8328-2e3fb7d8777e) [100%] 1 of 1 โœ”
[b0/59e24c] process > ArgoDataSubmissionWf:submissionReceipt                                       [100%] 1 of 1 โœ”
[8d/52ecee] process > ArgoDataSubmissionWf:printOut                                                [100%] 1 of 1 โœ”

Payload JSON File : /Users/esu/Desktop/GitHub/icgc-argo/argo-data-submission/argo-data-submission-wf/tests/work/c6/f5956f0d92d1e14daa5061be911a32/68211b84-d07b-4292-b52c-fe6311d79cb9.sequencing_experiment.payload.jsonAnalysis ID : 5c43282e-3fb7-4877-8328-2e3fb7d8777e
Submission TSV Receipt: /Users/esu/Desktop/GitHub/icgc-argo/argo-data-submission/argo-data-submission-wf/tests/work/b0/59e24c8b054a8284f0925c1fac3731/5c43282e-3fb7-4877-8328-2e3fb7d8777e_submission_receipt.tsv

This is b/c in sanity check:

for analysis in response.json():
### If analysis is suppressed; we ignore
if \
analysis["analysisState"]=="PUBLISHED" and \
analysis["experiment"]["experimental_strategy"]==metadata.get('experimental_strategy') and \
analysis['analysisType']['name']=="sequencing_experiment"\
:
sys.exit(
"Sample '%s'/'%s' has an existing published analysis '%s' for experiment_strategy '%s.'"
% \
(metadata.get('submitter_sample_id'),metadata.get('sample_id'),analysis['analysisId'],metadata.get('experimental_strategy'))
)
else:
return True

Evaluates on first instance of an analysis. In the event where a sample has multiple analyses, and the first analysis in the list will determine sanity pass/fail.

So for SA624380 that has a mixture of UNPUBLISHED and PUBLISHED analyses:

01d4e350-32e8-4731-94e3-5032e82731db UNPUBLISHED
026e7dbd-8a7b-4ee1-ae7d-bd8a7b0ee120 PUBLISHED
03d2714a-e860-48ad-9271-4ae860d8ad3d PUBLISHED
068c76dd-3806-4e0e-8c76-dd38064e0ea6 PUBLISHED
06e7cb7d-d5bd-4227-a7cb-7dd5bdc22740 UNPUBLISHED

B/C the first instance 01d4e350-32e8-4731-94e3-5032e82731db is UNPUBLISHED sanity check misses flagging this sample

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.