Giter Site home page Giter Site logo

analysis-wdls's People

Contributors

chrisamiller avatar evelyn-schmidt avatar johnmaruska avatar ksinghal28 avatar layth17 avatar leylabwustl avatar malachig avatar obigriffith avatar sridhar0605 avatar susannasiebert avatar tmooney avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

analysis-wdls's Issues

Optimize Optitype resource usage

Currently for the optitype step we request 64GB of ram and say nothing about CPUs.
https://github.com/griffithlab/analysis-wdls/blob/main/definitions/tools/optitype_dna.wdl

But when Cromwell requests 64GB of RAM from Google we get this reported back:

"machineType": "custom-10-65536",

Jul 06 12:02:22 malachi-immuno-test java[14290]: 2022-07-06 12:02:22,783 cromwell-system-akka.dispatchers.backend-dispatcher-8451 INFO  - PipelinesApiAsyncBackendJobExecutionActor [UUID(78967613)immuno.optitype:NA:2]: To comply with GCE custom machine requirements, cpu was adjusted from 1 to 10

This is an expensive machine that we then massively underutilize in the optitype script here:
https://github.com/genome/docker-immuno_tools-cwl/blob/master/optitype_script.sh

The bottom line is that we wind up with a machine with 64GB of RAM and 10 CPUs. Its not clear if we need that much ram, but we never use more than 4 CPUs. Thread usage for sambamba and bwa mem steps in the script are hard code to use 4 CPUs.

Fix StarFusion reference bundle

In discussions with Brian Haas, I have learned that our star-fusion reference bundle has an incorrect file:

Our bundle is here:
https://storage.cloud.google.com/griffith-lab-workflow-inputs/human_GRCh38_ens105/aligner_indices/star-fusion_1.10.1_index.zip

It has a generic version of AnnotFilterRule.pm which needs to be updated to use cancer specific white/black list information.

The updated file can be found here and swapped in:
https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/AnnotFilterRule.pm

This will require pulling the annotation bundle, unpacking it, replacing the file, zipping again and re-uploading it.

Consolidate tasks+workflows definitions where reasonable

As of this writing, almost all WDLs are just ports from CWL as close to 1:1 as possible. WDL lets us combine tasks into single files.

We'd like to consolidate pieces to group like pieces, group pieces that all rely on the same tool (e.g. samtools or picard), and we'd also like to combine common tasks for optimization reasons wherever reasonable. Identify places this can be done, do them as needed. This is probably an ongoing effort and not a single task.

Add class II HLA typing to immuno analysis

Currently we only have class I integrated. Even if we have clinical HLA typing results it seems import to calculate these from the normal/tumor exome data. So that we can compare to the expected results from clinical typing. Also we should support the use case where clinical typing for class II is not available.

This could be implemented in a conceptually similar way to what we do for class I.

  • Calculate class II types
  • Compare to clinical types if available and calculate consensus
  • Parse class II results and set up inputs for use in pvacseq / pvacfuse

Rename end bias plot

We currently produce a final output file: "qc/tumor_rna/rna_metrics.pdf"

This should be renamed to "qc/tumor_rna/rna_coverage_by_transcript_position.pdf"

In call-ps (pvacseq) step, make sure index is newer than VCF to avoid many warnings

Seeing a ton of warnings like this from the pvacseq step:

[W::hts_idx_load3] The index file is older than the data file: /cromwell_root/griffith-lab-test-immuno-pipeline/cromwell-executions/immuno/dedac0c9-a4e1-4178-b2d1-ee37995df2c4/call-phaseVcf/phaseVcf/ffb6dc78-a9ee-458e-802a-48f0edc78ecd/call-bgzipAndIndexPhasedVcf/bgzipAndIndex/897cf922-ffe9-4409-9e5e-5611dc3e6d9a/call-index/phased.vcf.gz.tbi

Presumably when the phased.vcf.gz.tbi is localized to /cromwell_root/ on the worker instance, it lands first and appears older than the actual VCF file.

Injecting a touch to this file after localization, so that it appears older on use should resolve this?

use localization_optional when possible

There is an option that allows large files (like bams) to be streamed from a gs:// address when the tool supports doing so:
https://cromwell.readthedocs.io/en/stable/optimizations/FileLocalization/

We would like to implement this where ever possible to save on disk usage and time spent localizing large bam files. (Mutect and GATK HaplotypeCaller are two obvious places to start)

Other options could include a "hack" to stream the bam through samtools first. Varscan in particular, seems amenable to this, as it is parallelized in chunks, so we could samtools view region >tmp.bam && mpileup tmp.bam | java -jar varscan.jar ... (or maybe even mpileup can stream directly?)

pvacseq/pvacfuse hangs and fails to complete

The tasks pvacseq and pvacfuse hang and eventually time out and to to a retry. This happened with a number of different samples. I was able to fix the hanging by changing back to the docker image used before the last PR.

Changing griffithlab/pvactools:4.0.3 to susannakiwala/pvactools:4.0.1rc1 to run the task in both analysis-wdls/definitions/tools/pvacseq.wdl and analysis-wdls/definitions/tools/pvacfuse.wdl fixed the problem and the run went to completion.

Fix germline variant calling inputs for the pvactool proximal variation analysis

It seems that the current germline analysis is filtering out common SNPs via a GNOMAD filter here:

call gfv.germlineFilterVcf as filterVcf {
input:
annotated_vcf=annotateVariants.annotated_vcf,
reference=reference,
reference_fai=reference_fai,
reference_dict=reference_dict,
gnomad_field_name=gnomad_field_name,
filter_gnomAD_maximum_population_allele_frequency=filter_gnomAD_maximum_population_allele_frequency,
limit_variant_intervals=limit_variant_intervals
}

This makes sense if the goal is to identify putative pathogenic/denovo variants perhaps. Or if the goal is to use this pipeline for germline only somatic variant calling.

BUT, it does not make sense if we want to get germline SNPs for the purposes of phasing with somatic variants and correcting peptide/neoantigen sequences to account for these proximal variants. In that context we want all germline variants that are real. Having a full germline variant VCF would also be useful in other contexts as well. For example, getting SNPs for LOH analysis.

We should update the pipeline to:

  • make it more clear by the names of steps/output files what is being done here
  • create an output that processes the germline VCF, largely the same way as currently, but does NOT apply the GNOMAD filter
  • update immuno to use the full germline VCF (without GNOMAD filtering) for the pvacseq proximal variants input
  • update immuno to save both versions of the germline VCF, one with GNOMAD variant filtered out and one that does not have this filter applied.

updates to pvacseq

We will want to make updates to support the following additions to pvacseq 3.1 and 4:

  • Support for BLAST and peptide scanning of reference proteome (peptide version is much faster) --run-reference-proteome-similarity, ....
  • Support for problematic amino acids: --problematic-amino-acids.
  • Support for --aggregate-inclusion-binding-threshold
  • Support for --allele-specific-anchors

This will require some changes to the pvacseq.wdl tool and to immuno to pass in the options.

Rename final result file: "optitype_calls.txt" to something more intuitive

Now that we have class I and II typing from optitype and phlat I believe this name no longer reflects the contents of this file:
hla_typing/optitype_calls.txt

A recent example of this file had these contents:
HLA-A*02:01,HLA-A*01:01,HLA-B*15:17,HLA-B*13:02,HLA-C*06:02,HLA-C*07:01,DQA1*01:02,DQA1*02:01,DQB1*02:02,DQB1*06:09,DRB1*07:01,DRB1*13:02

This is confusing, because optitype only makes ClassI calls. So how can we have all those ClassII alleles here? Presumably these are coming from Phlat as well now?

I believe the file is being created here:
https://github.com/griffithlab/analysis-wdls/blob/50122a2c90285d14a8cf06dddc92134714da2e9c/definitions/tools/hla_consensus.wdl#L185

It seems like maybe that file now contains the consensus of:
hla_typing/phlat_normal_HLA.sum and hla_typing/optitype_normal_result.tsv

The code above, sounds like it will just write out the optitype calls...

But actually now it looks like when we extract HLA alleles, we do so for both Phlat and Optitype. And then pass all of this into the hlaConsensus task:
https://github.com/griffithlab/analysis-wdls/blob/018f5c0721252a57c22435a68123f611612fb7f1/definitions/immuno.wdl#L378-L390

Those inputs in those steps could also be renamed perhaps to be a bit clearer.

Experiment with use of private VMs

Currently all VMs seem to be using ephemeral external IPs. Google charges for these similar to what they would charge if you actually reserve an IP, even if its in use.

It also theoretically makes the instance less secure. It not clear that we actually need to be able to access worker instances externally. Perhaps, only the cromwell server needs to be accessed externally. If one did need to get onto an individual worker instance, that could still be done by using the Cromwell server itself as a jump point.

Here is some discussion of how to do this.
https://github.com/atgu/cromwell_google_setup#using-vms-with-private-ip-addresses

Essentially to do this you can add noAddress: true to the runtime block of a task. We could test this and see if it breaks anything.

Improve selection of variant annotation for the variants.tsv

Currently we sometimes get a sub-optimal variant annotation being selected in the TSV. While this doesn't impact the VCF or the pVACtools analysis ... it is handy to have a table of variants where the "top" variant annotation is selected (one row per variant).

Investigate inconsistent variant results from varscan when using localization_optional

We have observed that shards of varcan work are sporadically failing and not being caught by Cromwell.

This manifests as statements like this in the Varscan stderr:

[E::bgzf_read_block] Failed to read BGZF block data at offset 2296748725 expected 9949 bytes; hread returned -1
[E::bgzf_read] Read block operation failed with error 4 after 65 of 234 bytes
samtools mpileup: error reading from input file

To get a clean run of VarScan results for comparison we can turn localization optional off by removing this:

parameter_meta {
tumor_bam: { localization_optional: true }
tumor_bam_bai: { localization_optional: true }
normal_bam: { localization_optional: true }
normal_bam_bai: { localization_optional: true }
}

Then to hopefully cause Crowell to detect the failures and retry we can add the following:

set -o pipefail

here

command <<<
set -o errexit
set -o nounset

Automate process for tagged releases

As of this writing, there is no versioning handled. Creating releases a release process, preferably automated, gives us some consistency benefits. Follow-up work for this would be to allow cloud-workflows approaches specify what release of this repo to use.

Add use of monitoring script to cromwell runs

The Google Cromwell support allows for use of a "monitoring_script"

https://cromwell.readthedocs.io/en/stable/backends/Google/

{
  "monitoring_script": "gs://cromwell/monitoring/script.sh"
}

"The output of this script will be written to a monitoring.log file that will be available in the call gcs bucket when the call completes. "

This sounds very helpful. But what would such a monitoring script look like? I have not been able to find any example of someone using this functionality.

Add tabix indexed file to output

We currently keep this final variants file as an output:
final_results/annotated.expression.vcf.gz.

We should also keep the tabix index for it.

Update hla_consensus.wdl to handle ambiguity codes.

Sometimes the clinical reports report HLA codes like "HLA-B*08:YETY" that don't match our expected four digit resolution like HLA-\w*\d{2}:\d{2}

This site contains a list of ambiguity codes and the alleles they match up with. https://hml.nmdp.org/MacUI/

For example, HLA-B*08:YETY translates to three options: HLA-B*08:01/HLA-B*08:19N/HLA-B*08:109

The goal is to update the hla_consensus.wdl code to handle these intelligently by:

  1. disambiguating any codes from the clinical HLA typing inputs
  2. comparing the possibilities to those reported by Optitype or Phlat and choosing the alleles that match
  3. if there is no disambiguation, or there is no match, drop the ambiguous coded allele and handle the rest in the same way as is currently done (unique union?)

Allow user to set a gnomad allele freq threshold for filtering

maximum_population_allele_frequency is used to filter variants with GNOMAD allele frequency > than this value. The default is set in detect_variants to 0.001. If a variant has gnomad_field_name > this value, it gets filtered out.

Float filter_gnomADe_maximum_population_allele_frequency = 0.001

Float filter_gnomADe_maximum_population_allele_frequency

Float maximum_population_allele_frequency

BUT, this parameter is not exposed up to the pipelines. We would like to be able to set this in immuno.wdl.

One use case would be to essentially turn off GNOMAD filtering by setting: immuno.maximum_population_allele_frequency: 1.

Finish/Verify conversions from CWL

The following text is taken from the old README at the root of this repository. I am unsure how many of these have yet to be converted, but it's worth looking through, verifying missing pieces, and making a decision to convert or omit.


Conversions

Pipelines

  • alignment_exome
  • alignment_exome_nonhuman
  • alignment_umi_duplex
    # this depends on a thing with non-trivial embedded javascript
  • alignment_umi_molecular
    # this depends on a thing with non-trivial embedded javascript
  • alignment_wgs
  • alignment_wgs_nonhuman
  • aml_trio_cle
  • aml_trio_cle_gathered # This doesn't make sense in cloud
  • bisulfite
  • chipseq
    # This depends on homer-tag-directory, doesn't make sense in cloud
  • chipseq_alignment_nonhuman
    # This depends on homer-tag-directory, doesn't make sense in cloud
  • detect_variants
  • detect_variants_nonhuman
  • detect_variants_wgs
  • downsample_and_recall
  • gathered_downsample_and_recall # This doesn't make sense in cloud
  • germline_exome
  • germline_exome_gvcf
  • germline_exome_hla_typing
  • germline_wgs
  • germline_wgs_gvcf
  • immuno
  • rnaseq
  • rnaseq_star_fusion
  • rnaseq_star_fusion_with_xenosplit
  • somatic_exome
  • somatic_exome_cle
  • somatic_exome_cle_gathered # This doesn't make sense in cloud
  • somatic_exome_gathered # This doesn't make sense in cloud
  • somatic_exome_nonhuman
  • somatic_wgs
  • tumor_only_detect_variants
  • tumor_only_exome
  • tumor_only_wgs

Subworkflows

  • align
  • align_sort_markdup
  • bam_readcount
  • bam_to_trimmed_fastq_and_hisat_alignments
  • bgzip_and_index
  • bisulfite_qc
  • cellranger_mkfastq_and_count
  • cnvkit_single_sample
  • cram_to_bam_and_index
  • cram_to_cnvkit
  • docm_cle
  • docm_germline
  • duplex_alignment
  • filter_vcf
  • filter_vcf_nonhuman
  • fp_filter
  • gatk_haplotypecaller_iterator
  • germline_detect_variants
  • germline_filter_vcf
  • hs_metrics
  • joint_genotype
  • merge_svs
  • molecular_alignment
  • molecular_qc
  • mutect
  • phase_vcf
  • pindel
  • pindel_cat
  • pindel_region
  • pvacseq
  • qc_exome
  • qc_exome_no_verify_bam
  • qc_wgs
  • qc_wgs_nonhuman
  • sequence_align_and_tag_adapter
  • sequence_to_bqsr
  • sequence_to_bqsr_nonhuman
  • sequence_to_trimmed_fastq
  • sequence_to_trimmed_fastq_and_biscuit_alignments
  • single_cell_rnaseq
  • single_sample_sv_callers
  • strelka_and_post_processing
  • strelka_process_vcf
  • sv_depth_caller_filter
  • sv_paired_read_caller_filter
  • umi_alignment
  • varscan
  • varscan_germline
  • varscan_pre_and_post_processing
  • vcf_eval_cle_gold
  • vcf_eval_concordance
  • vcf_readcount_annotator

Tools

  • add_strelka_gt
  • add_string_at_line
  • add_string_at_line_bgzipped
  • add_vep_fields_to_table
  • agfusion
  • align_and_tag
  • annotsv
  • annotsv_filter
  • apply_bqsr
  • bam_readcount
  • bam_to_bigwig
  • bam_to_cram
  • bam_to_fastq
  • bam_to_sam
  • bcftools_merge
  • bedgraph_to_bigwig
  • bedtools_intersect
  • bgzip
  • biscuit_align
  • biscuit_markdup
  • biscuit_pileup
  • bisulfite_qc_conversion
  • bisulfite_qc_coverage_stats
  • bisulfite_qc_cpg_retention_distribution
  • bisulfite_qc_mapping_summary
  • bisulfite_vcf2bed
  • bqsr
  • call_duplex_consensus
  • call_molecular_consensus
  • cat_all
  • cat_out
  • cellmatch_lineage
  • cellranger_atac_count
  • cellranger_count
  • cellranger_feature_barcoding
  • cellranger_mkfastq
  • cellranger_vdj
  • cle_aml_trio_report_alignment_stat
  • cle_aml_trio_report_coverage_stat
  • cle_aml_trio_report_full_variants
  • clip_overlap
  • cnvkit_batch
  • cnvkit_vcf_export
  • cnvnator
  • collect_alignment_summary_metrics
  • collect_gc_bias_metrics
  • collect_hs_metrics
  • collect_insert_size_metrics
  • collect_wgs_metrics
  • combine_gvcfs
  • combine_variants
  • combine_variants_concordance
  • combine_variants_wgs
  • concordance
  • cram_to_bam
  • docm_add_variants
  • docm_gatk_haplotype_caller
  • downsample
  • duphold
  • duplex_seq_metrics
  • eval_cle_gold
  • eval_vaf_report
  • extract_hla_alleles
  • extract_umis
  • fastq_to_bam
  • filter_consensus
  • filter_known_variants
  • filter_sv_vcf_blocklist_bedpe
  • filter_sv_vcf_depth
  • filter_sv_vcf_read_support
  • filter_sv_vcf_size
  • filter_vcf_cle
  • filter_vcf_coding_variant
  • filter_vcf_custom_allele_freq
  • filter_vcf_depth
  • filter_vcf_docm
  • filter_vcf_mapq0
  • filter_vcf_somatic_llr
  • fix_vcf_header
  • fp_filter
  • gather_to_sub_directory
  • gatherer
  • gatk_genotypegvcfs
  • gatk_haplotype_caller
  • generate_qc_metrics
  • germline_combine_variants
  • grolar
  • group_reads
  • hisat2_align
  • hla_consensus
  • homer_tag_directory
    # This doesn't make sense in cloud
  • index_bam
  • index_cram
  • index_vcf
  • intersect_known_variants
  • interval_list_expand
  • intervals_to_bed
  • kallisto
  • kmer_size_from_index
  • manta_somatic
  • mark_duplicates_and_sort
  • mark_illumina_adapters
  • merge_bams
  • merge_bams_samtools
  • merge_vcf
  • mutect
  • name_sort
  • normalize_variants
  • optitype_dna
  • picard_merge_vcfs
  • pindel
  • pindel2vcf
  • pindel_somatic_filter
  • pizzly
  • pvacbind
  • pvacfuse
  • pvacseq
  • pvacseq_combine_variants
  • pvacvector
  • read_backed_phasing
  • realign
  • remove_end_tags
  • rename
  • replace_vcf_sample_name
  • samtools_flagstat
  • samtools_sort
  • select_variants
  • sequence_align_and_tag
  • sequence_to_bam
    # this uses non-trivial embedded javascript
  • sequence_to_fastq
  • set_filter_status
  • single_sample_docm_filter
  • smoove
  • somatic_concordance_graph
  • sompy
  • sort_vcf
  • split_interval_list
  • split_interval_list_to_bed
  • staged_rename
  • star_align_fusion
  • star_fusion_detect
  • strandedness_check
  • strelka
  • stringtie
  • survivor
  • transcript_to_gene
  • trim_fastq
  • umi_align
  • variants_to_table
  • varscan_germline
  • varscan_process_somatic
  • varscan_somatic
  • vcf_expression_annotator
  • vcf_readcount_annotator
  • vcf_sanitize
  • vep
  • verify_bam_id
  • vt_decompose
  • xenosplit

somatic wgs does not include gnomad filtering

we use it in exome, in the filter_vcf step, but the gnomad vcf we're using is only a subset, restricted to exome coords. Would be nice to revisit that decision at some point (options: leave it out, apply it to exome only, or find a new genome-wide VCF)

Organize outputs into structs where appropriate

Pipelines which are considered "terminal" should get their outputs organized into structs (which coupled with our scripts for pulling results, makes for neatly nested directories). Right now immuno.cwl is organized in this way.

  1. define the list of these (certainly somatic exome, rnaseq, immuno, etc)

  2. figure out how to pass results from wdls that can be run either as subworkflows or stand-alone. (i.e. rnaseq, which feeds into immuno). Can the output structs just be returned and elements accessed from the higher level wdls? Or do we need to use a param/conditionals to say "do this struct stuff only if you're not being run as a subworkflow".

optimize mark_duplicates_and_sort

The tools/mark_duplicates_and_sort.wdl is a bottleneck, especially for WGS. It's expensive, and that's partially because it is long-running and gets preempted. Do some local testing on the cluster to explore options for optimizing it:

  • right now the sort and markdup get 8 cores each. Is that the optimal ratio? If one is faster than the other, there'll be wasted cycles.
  • If we increase the number of overall cores, how does that affect runtime? (do we saturate I/O? is that different between HDD/SSD?)
  • Can we prevent localization of the input files to save an hour or so?
  • would giving more ram to the sort part of that step allow it to do less slow writes of temp files to disk and speed things up?

Explore streaming sequence to Fastq and/or trimFastq steps

These two steps eat up lots of disk costs:
sequenceToFastq: 0.139237127 local-disk 552 SSD (two of those)
trimFastq: 0.792021773 local-disk 1148 SSD (two of those)

  1. sequenceToFastq runs even when fastqs are given, and makes a copy - this is wasteful. Can we set conditional execution of that step (does WDL support that?)

  2. trimfastq could probably by piped directly into bwa, sacrificing some composability for speed. There is already optional adapter trimming in sequence_align_and_tag.wdl. Seems like we could add other trimming there as well (or in the HISAT or STAR steps for RNA)

Watch out for problematic values in manually supplied metrics for FDA report

I encountered this error:

Mar 10 10:45:13 malachi-immuno-mcdb024 java[14444]: 2023-03-10 10:45:13,121 cromwell-system-akka.dispatchers.engine-dispatcher-3541 INFO  - WorkflowManagerActor: Workflow 1c5196b6-0756-4d54-943d-59755d2198f0 failed (during ExecutingWorkflowState): java.lang.Exception: Task generateFdaMetricsForBamOrFastqs.generateFdaTables:NA:3 failed. The job was stopped before the command finished. PAPI error code 9. Execution failed: generic::failed_precondition: while running "/cromwell_root/script": unexpected exit status 2 was not ignored
Mar 10 10:45:13 malachi-immuno-mcdb024 java[14444]: [UserAction] Unexpected exit status 2 while running "/cromwell_root/script": /cromwell_root/script: line 42: syntax error near unexpected token `('
Mar 10 10:45:13 malachi-immuno-mcdb024 java[14444]: /cromwell_root/script: line 42: `      --spike_in_error_rate <=0.50% (R1), <=0.75% (R2) \'
Mar 10 10:45:13 malachi-immuno-mcdb024 java[14444]:         at cromwell.backend.google.pipelines.common.PipelinesApiAsyncBackendJobExecutionActor$.StandardException(PipelinesApiAsyncBackendJobExecutionActor.scala:91)

Which seems to be caused by this entry in my yaml: immuno.normal_dna_spike_in_error_rate: "<=0.50% (R1), <=0.75% (R2)"

Same idea would presumably apply to all the new inputs that allow a user to feed in values describing the data generation for the FDA QC report.

Out of resources in FDA unaligned metrics step

In a recent test run I got four failures (first attempt and all three retries) on this:
generateFdaMetrics -> call-unaligned_tumor_dna_fda_metrics -> generateFdaMetricsForBamOrFastqs -> call-unalignedSeqFdaStats

In all four failures the stdout and stderr files are empty. The log file shows that localizing the files worked and then cromwell started to run /cromwell_root/script but nothing after that.

I logged into an instance and it seemed to running the Perl code, using most of the disk space and memory usage was growing. Maybe it runs out of either Mem or Disk on step for larger input files. We have had plenty of successes for this step on other inputs.

I think the Perl code in question is this stuff:

        #!/usr/bin/perl

        use strict;
        use warnings;
        use Carp;

        use FileHandle;
        use File::Basename;
        use File::Spec::Functions;
        use IO::Uncompress::AnyUncompress;


        # sets global variables with the default
        use vars qw/$filter $offset $samtools/;

        # defines the minimum base call quality score to filter
        # inclusive, for example, Bases with >= Q30
        $filter = 30;

        # defines the offset to calculate the base call quality score from the Phred +33 encoded by ASCII characters
        # For example, 33 = 63.06707094 - 30.06707094 (see below)
        $offset = 33;

        # specifies the program paths in docker
        $samtools = "/opt/samtools/bin/samtools";


        # main subroutine
        Main();

        # program exits here
        exit 0;


        sub Main {
            # for the paths of input files
            my @paths = @ARGV if @ARGV > 0;
            croak "input file path required" unless @paths > 0;

...

This WDL I think?
https://github.com/wustl-oncology/analysis-wdls/blob/main/definitions/tools/unaligned_seq_fda_stats.wdl

I wonder if we might bump the memory a bit there? Conditionally? Or perhaps disk space a bit more?

I also wonder if it would be possible to add some kind of progress logging output so that we could see how this long running task is going and also see where it gets before failing in situations like this.

I wish I had the resource monitoring script turned on for this test. That might have told us what exactly is going on here.

Gracefully handle partial optitype HLA typing

This issue was discovered using a downsampled input data set.

If Optitype is unable to type an HLA gene, it reports a blank entry for that column.

This causes a failure in the extract HLA allele step of the immuno.wdl workflow. This step expects all six HLA allele entries to be present. But if there isn't enough data to call one or more, optitype still completes successfully and reports what it can. This seems reasonable and we should handle this case and proceed with whatever HLA alleles we get.

We believe the problematic line is this awk here:
https://github.com/griffithlab/analysis-wdls/blob/513d16782c93acbdda23949e03c17035128613ca/definitions/tools/extract_hla_alleles.wdl#L17

sambamba-sort error: not enough data in stream

Recently encountered a case that hit this error.

sambamba-sort: Error reading BGZF block starting from offset 0: stream error: not enough data in stream

In this step:
call-somaticExome -> somaticExome -> call-normalAlignment -> sequenceToBqsr -> call-markDuplicatesAndSort

Notes:

  • Error was seen in all three reattempts
  • The input FASTQ files to the workflow seem okay
  • The input BAM from the previous step seems okay

Even though the input BAM in this case is actually a smallish normal sample BAM, its possible that the space required relative to its size is large and we are running out of disk space?

Disk space needed is currently calculated as follows:
https://github.com/wustl-oncology/analysis-wdls/blob/5c745330ef128404f0b94e819813c28eaf727193/definitions/tools/mark_duplicates_and_sort.wdl#LL[โ€ฆ]C7

We could try increasing the multiplier (add cost). Or maybe just increase the base amount:

e.g.
From
Int space_needed_gb = 10 + round(5*size(bam, "GB"))
To
Int space_needed_gb = 20 + round(5*size(bam, "GB"))

Relax star fusion detection parameters for filtering

The default parameters may be a bit too strict for lower purity tumors. In general we have not been seeing very large sets of fusion calls for most of our samples. I think we could relax our filtering a bit and then using fusion-inspector and manual review to watch out for false positives.

Candidates for update:

  • --min_FFPM. The default of 0.1 resulted in a known EML4-ALK fusion being missed even though it seemed to pass all other criteria. Reduce to 0.05 or even 0.025?
  • --min_alt_pct_junction. Default is 10%

Move specification of preemptible and maxRetries into runtime blocks of each task

Currently we specify a few runtime attributes as global defaults (which is convenient). Apparently Terra does not support this notion and it would also be nice to have more granular control anyway.

Right now we are doing this as a default:
/shared/cromwell/workflow_options.json

    "default_runtime_attributes": {
        "preemptible": 1,
        "maxRetries": 2
    },

These defaults are loaded as each workflow is launched on the standalone cromwell server. However, to work with Terra we would want these be specified in the runtime block of each Task. That would also allow for certain steps to be configured differently with regard to preemptible choice or number of retries.

Investigate intersection/annotation of "known variants" when a variant VCF from CLE for example is supplied

For some runs it is desirable to annotate variant results from the pipeline with those that were also called in another pipeline. For example these could be variants from a CLIA/CAP pipeline and checking pipeline called variants against this "known" or "validated" set may be required for compliance.

Currently in immuno.wdl you can supply validated variants as an input in the form of an indexed VCF

File? validated_variants
File? validated_variants_tbi

The validated_variants input gets passed to somatic_exome.wdl and then gets passed to detect_variants.wdl and then gets passed to subworkflows/filter_vcf.wdl and then gets passed to tools/filter_known_variants.wdl.

filter_known_variants uses a series bcftools commands with the intent to perhaps annotate variants with VALIDATED.

My first question was, is this actually happening? The answer appears to be yes. Or final VCFs include various annotations and the last three looking something like this: CSQ=;VALIDATED;set=mutect-strelka

My first thought on this is that this does not seem to follow the convention of VCF? Seems to me that it would be more correct to define a new tag (e.g. known_variant_status), define this in the VCF header, and then give a value of VALIDATED or NOT_FOUND.

It seems that while these annotations are present in the final VCF, being validated is NOT required to be in that VCF. In this example /storage1/fs1/gillandersw/Active/Project_0001_Clinical_Trials/pancreas_leidos/analysis/TWJF-5120-02_BGA-2103/gcp_immuno/final_results/annotated.expression.vcf.gz there are:

  • 45 variants that are VALIDATED
  • 61 that do not include this string. And 36 of these variants are marked PASS. I believe these are being carried forward to neoantigen prediction. Maybe that is what we actually want to do, but we need to discuss.

While reviewing this I noticed that immuno.wdl also has this step that takes the final variants from somatic exome as input:

call ikv.intersectKnownVariants as intersectPassingVariants {
input:
vcf=somaticExome.final_filtered_vcf,
vcf_tbi=somaticExome.final_filtered_vcf_tbi
}

These lines invoke tools/intersect_known_variants.wdl.

This sounds like the step that intends to actually remove the non validated variants if the user supplied such a list.

My interpretation of this code, is that it takes the validated VCF supplied by the user and pulls out those that are PASS. And then it intersects this list with the PASS variants in the pipeline VCF. This is not happening.

When this tool intersect_known_variants.wdl is called from immuno.wdl, don't we need to pass in validated_variants and validated_variants_tbi?? Unless those are defined, it seems the intersection step is skipped and the input VCF is merely copied.

Runs failing at RNA_star_fusion's AgFusion step when no fusions are found.

Hi, my runs failed during the RNA star fusion's Ag fusion step with the following error-

2022-09-30 04:46:01,060 - AGFusion - WARNING - Output directory agfusion_results already exists! Overwriting... WARNING:AGFusion:Output directory agfusion_results already exists! Overwriting... 2022-09-30 04:46:01,061 - AGFusion - ERROR - Read 0 fusions from the file! Exiting... ERROR:AGFusion:Read 0 fusions from the file! Exiting...

We checked the RNA Star fusion detect outputs to ensure that the tsv generated from there was in fact empty (just had a header line)

While having 0 fusions detected is unlikely (and may have happened here because I had the wrong RNA strand settings), Malachi and I figured that this might still be a bug that needs to be addressed.

The links for the runs are here-
https://console.cloud.google.com/storage/browser/griffith-lab-test-kartik/cromwell-executions/immuno/797098d6-5d63-43e8-b96d-77bba2336e17
https://console.cloud.google.com/storage/browser/griffith-lab-test-kartik/cromwell-executions/immuno/ab45613e-a82c-4e3a-9e7c-d849fd1ae206

Minor updates to output naming and organization for immuno.wdl

A few small things that would make the final results of immuno a bit tidier:

  • At the top level on the results dir we have a file: pvacseq.annotated.tsv. I think it would be more accurate to describe this simply as variants.final.annotated.tsv. The variants aren't really coming from pvacseq. Even though they are coming from the pvacseq sub-workflow, nothing in that TSV is a pvacseq related value.
  • At the top level. Rename pvactools -> pvacseq
  • At the top level. Rename pvacfuse_predictions -> pvacfuse
  • At the top level. Move fusioninspector_evidence to be inside the rnaseq dir
  • At the top level. Move fda_metrics to be inside the qc dir

optimize CPU/RAM requests

Several related issues here:

  1. if you create a custom N1 VM, it can have up to 6.5 GB of memory per vCPU. So for example when we select 32GB for mutect step, this forces us to request 6 CPUs. A list of places where we've been bumped up to more cores is here:
    /storage1/fs1/mgriffit/Active/griffithlab/pipeline_test/gcp_wdl_test/saved_results/final_results_v1_fusions_ens95/workflow_artifacts/extra_cpu_requests.txt

  2. WGS runs require more resources than exomes in many cases, but our memory/CPU values are set so that the largest data sets will run. Either a) Provide a parameter that allows for specifying WGS or Exome at the top level or b) use the bam size directly to estimate mem usage in some of these steps.

Metrics files

Right now, there are two metrics files that aren't getting spit out in the final results from the immuno pipeline.

  1. we're not generating a flagstat for the RNAseq bam (or if we are, we're not outputting it)
  2. we're not keeping the markduplicates metrics file.

Both should be easy enough to add and dump out.

Out of disk error from FDA reports workflow

I encountered a failure at this step in a recent test:
immuno -> call-generateFdaMetrics -> call-aligned_tumor_dna_cram_to_bam

Encountered this error:

2023/03/12 15:46:32 Running user action: docker run -v /mnt/local-disk:/cromwell_root --entrypoint=/bin/bash quay.io/biocontainers/samtools@sha256:141120f19f849b79e05ae2fac981383988445c373b8b5db7f3dd221179af382b /cromwell_root/script
[E::bgzf_flush] File write failed (wrong size)
samtools view: writing to "tumor.bam" failed: No space left on device
[E::bgzf_close] File write failed
samtools view: error closing "tumor.bam": -1
/cromwell_root/script: line 25: echo: write error: No space left on device
tee: /cromwell_root/stderr: I/O error

The call is here:

call cram_to_bam.cramToBam as aligned_tumor_dna_cram_to_bam {
input:
reference = reference,
reference_index = reference_index,
reference_dict = reference_dict,
cram = aligned_tumor_dna,
cram_index = aligned_tumor_dna_index
}

Which calls here:

Int space_needed_gb = 10 + round(size([cram, cram_index, reference, reference_index, reference_dict], "GB"))

Which defines space needs like this:
Int space_needed_gb = 10 + round(size([cram, cram_index, reference, reference_index, reference_dict], "GB"))

It does seem like this is likely to be insufficient for larger input files. The space request is based on the size of the CRAM with no multiplier + 10Gb. But the expanded BAM can be much larger than the CRAM. So I think we want some kind of multiplier here.

Going to try this:
Int space_needed_gb = 10 + round(size([cram, cram_index, reference, reference_index, reference_dict], "GB") * 2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.