The balsamic from clinical-genomics

coverage threshold 5 is missing

BALSAMIC/BALSAMIC/snakemake_rules/quality_control/sambamba_depth.rule

Line 33 in b716ef3

"--cov-threshold {params.cov_4} {input.bam} > {output}; "

Variant calling AF threshold for Vardict as a parameter in config

Sorting data from Picard

The reports coming out from Picard should be sorted row wised.

Tumor mutational burden

TMB was defined as the number of somatic, coding, base substitution, and indel mutations per megabase of genome examined. https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-017-0424-2
1.1. Non-coding alterations were not counted.
1.2 Alterations listed as known somatic alterations in COSMIC and truncations in tumor suppressor genes were not counted
1.3 Alterations predicted to be germline by the somatic-germline-zygosity algorithm were not counted
1.4 Alterations that were recurrently predicted to be germline in our cohort of clinical specimens were not counted.
1.5 Known germline alterations in dbSNP were not counted.
1.6 To calculate the TMB per megabase, the total number of mutations counted is divided by the size of the coding region of the targeted territory.
1.7 select the table 1 from paper above as a ref for comparison.
Tumor mutation burden (TMB), fraction of copy number–altered genome, and gene alterations were compared among patients with DCB and no durable benefit (NDB). http://ascopubs.org/doi/full/10.1200/JCO.2017.75.3384
2.1 in addition to above, copy number alterations were also counted.

CollectAlignmentSummaryMetrics

An additional rule for picard CollectAlignmentSummaryMetrics

UMI support for BALSAMIC

Literature review
A list aligners
Variant callers
QC and error correction

Changed working directory for jobsubmission

from sbatch set -D, --workdir=directory for job submissions

Manta single sample mode

Manta in single sample mode should only have the following files in their output:

"tumorSV.vcf.gz", "candidateSV.vcf.gz", "candidateSmallIndels.vcf.gz"

while right now it is:

"diploidSV.vcf.gz", "somaticSV.vcf.gz", "candidateSV.vcf.gz", "candidateSmallIndels.vcf.gz"

CollectInsertSizeMetrics

implementation of CollectInsertSizeMetrics results from Picard in results.

Result report filter config

Result report filter config is series of filters, configs, and parameters to summarize analysis results as an effort to present a list of actionable targets and discovered new targets.

This aggregated list will generate 4 list of variants for SNV and INDELs:

High confidence set of MSK-IMPACT pipeline's exact replica; See FDA approved figure 1.
Low confidence set of MSK-IMPACT pipeline's exact replica, see FDA approved figure 1.
High confidence set of variants. identified by at least 1 variant caller, AD>=1, VD>=1, and in MVL.
Low confidence set of variants. These variants are not in MVL, thus low confidence. So instead, a set variant caller specific filters will be applied according to recent research and clinical findings.

Add flag for mark duplicate in the config file

Coverage report sometimes returns 0 coverage for all genes

Remove/mark duplicate as a flag in config

create_config doesn't create config file properly if path does not exist.

An email to be sent when the work flow finishes

PASS filter remove from Strelka

BALSAMIC/BALSAMIC/snakemake_rules/variant_calling/strelka.rule

Line 39 in 9df51b2

    
           "bcftools view -Oz -f .,PASS {params.tmpdir}/results/variants/somatic.indels.vcf.gz > {params.result_dir}somatic.indels.vcf.gz; "

Look into a solution for complex variants

Vardict reports bunch of complex variants that need to be either decomposed or put into another batch of variants.

Add more PCT_TARGET_BASES steps

Add up tom 1000X instead of 200X that is today.

After 200X, it should be increased by 50X (e.g. 250, 300, etc) to make it possible to plot the data

BALSAMIC extension for statusDB

A list of metrics to export from pipeline
A list of metrics from lab side
Table structure matching with statusDB

Move stargazer, data.table, and optparser to main env with balsamic root

data.table
optparser
stargazer

Single sample mode workflow

Single sample mode workflow on top of #20 and #19
Single sample mode CLI

Remove strelka from json config file for single sample

new log and script path should be created for re-runs

Summary statistics computed on coverage values per exon normalized by per-sample coverage

QC metric after remove duplicate

Structural variant visualization

A new tool for plotting structural variants: samplot.py: https://github.com/ryanlayer/samplot

Single sample mode non-varcaller components

Contest
Annotation

CollectMultipleMetrics

CollectMultipleMetrics can gather multiple metrics, but it might fail on some metrics due to various reasons (e.g. java memory issue). Prepare a list of metrics that we are interested from this.

Observed vs Expected variant frequency given a set of individual samples (N>=10)

Describe what needs to done, similar to figure 2 in this document: https://www.accessdata.fda.gov/cdrh_docs/reviews/den170058.pdf

Merge caller VCF - paired mode

VCF union of all three caller:

These aggregation should have AD and DP
Caller type
PASSed variants only

Single sample mode variant caller rules

Vardict
Strelka
Manta
Mutect

Reporting actual DP and AD in Mutect2

It seems the AF that mutect2 in gatk3.8 is reporting is not matching the actual AD values. This is due to AD show is "unfiltered" AD. And read depth values are just estimates "before" applying the internal filters. broadinstitute/gatk#3808 and thus DP is not reported by Mutect2 in gatk3.8 (however, it is reported in gatk4 version, where AD values are also not matching with gatk3.8). This could be due to internal filtering process that's happening behind the scene.

A solution to get the values used by Mutect2 to add the actually AD to VCF is to invoke Coverage annotation through --annotation command. Note that DP value reported here will still be unfiltered reads of tumor + normal. For actually DP value, AD values need to be summed.

Add plot functionality to coverage report script

Coverage vs Genomics coordinate
Single plot with gene vs coverage for each report config sections (read through MSK MVL list)
normalized sample coverage

os.makedirs(os.path.dirname(json_out), exist_ok=True)

shit -> shift

BALSAMIC/BALSAMIC/config/MSK_impact.json

Line 22 in 2d501c7

"INDEL": {"frameshift_variant", "frameshift", "non-frameshit"}

update description for fastq tumor and normal
check for fastq exists or not

Installation

In accordance with https://github.com/Clinical-Genomics/Goals/issues/51

Prepare bed file for analysis

BED files need interval preparation and run estimation time. A separate mini-analysis to prepare and split bed files can speed up analysis and reduce runtime.

clinical-genomics / balsamic Goto Github PK

balsamic's People

Contributors

Stargazers

Watchers

Forkers

balsamic's Issues

Recommend Projects

Recommend Topics

Recommend Org