nyu-molecular-pathology / ngs580-nf Goto Github PK

Target exome sequencing analysis for NYU NGS580 gene panel

License: GNU General Public License v3.0

Nextflow 48.64% Makefile 6.53% TeX 1.50% Shell 1.76% Python 21.86% R 11.08% Roff 3.30% Ruby 0.12% CSS 0.26% Dockerfile 4.94%

docker exome exome-sequencing-analysis nextflow pipeline singularity

ngs580-nf's People

Contributors

Stargazers

Watchers

Forkers

stevekm maurya-anand javrodriguez jtpoirier cgpu w28924461701

ngs580-nf's Issues

LoFreq Somatic output include dbSNP variants

LoFreq Somatic variant calling outputs snvs and indels with and without dbSNP; need to make sure to update pipeline to use the variants with dbSNP, do not exclude dbSNP variants.

Sometimes values loaded from JSON are not available inside processes

See issue here: nextflow-io/nextflow#1342

Need to make sure that all required values are being loaded into a new object(s), out of the JSON parsed object, and use those instead throughout the pipeline

need to refactor reference file staging methods

In conjunction with #9 need to consider alternative staging methods for reference files, especially in cases where a path to an entire directory is passed such as for ANNOVAR databases and some genome.fa files. Files should be staged directly, instead of staging the entire directory or passing just dir path.
Stage-in modes such as 'copy' could be combined with 'scratchDir' for a potential speedup for processing files directly out of HPC node NVMe SSD space with reduced GPFS overhead. Also consider things like stage-in via RAM disk especially for items like ANNOVAR databases, not sure if this will be feasible due to the extremely large storage requirements for them. Might need to split ANNOVAR annotation into separate processes for each database and stage each db individually then recombine.

Need to refactor channel `create` method

Need to go through the pipeline and update sections that use the create method, due to messages from Nextflow:

WARN: The channel `create` method is deprecated -- it will be removed in a future release

Will probably be needed as we migrate to newer versions of Nextflow to utilize new features.

Add subdirs to publishDirs for .bam files

Right now all the .bam files are published to a single directory, need to add subdirs based on the type of .bam file it is

docker build error

Hi!

Thank you for your code. (NGS580-nf).

When I try to build of Docker, I got the error in below.

$ git clone https://github.com/NYU-Molecular-Pathology/NGS580-nf.git
$ cd NGS580-nf/containers
$ make build-all-Docker

How can I solve this issue ? Please check this issue.

Update QC report metrics

Need to update the analysis-wide QC report to better reflect quality metrics, include descriptions, describe & visualize cutoffs & thresholds, etc

add FACETS to pipeline

need to add FACETS analysis step to pipeline for tumor purity analysis

https://github.com/mskcc/facets

https://github.com/mskcc/facets/blob/master/vignettes/FACETS.pdf

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5027494/

https://anaconda.org/bioconda/r-facets

copy samplesheet to output dir

Need a copy of the samplesheet used in the output directory for records with syncing output-only scripts

need bash trap for .nextflow.submitted

Need to update the Makefile submit recipe to use a bash trap to catch kill commands and automatically remove the .nextflow.submitted lock file. Otherwise you have to always remember to delete .nextflow.submitted every time you use make kill which gets annoying

Remove 'done' channels

Many empty channels such as "done_copy_samplesheet" were previously used to force the custom reporting steps to wait until the end of the pipeline before running. This is no longer necessary, these channels should be removed.

Add configs for different targets.bed files

Need to add an easy interface to specify which targets.bed file to use, from a selection of several included choices. Need to update the analysis report to clearly state the filename and targets file choice in order to differentiate between e.g. 580, 629, and future targets.

Need scope writeup

Need writeup of scope for pipeline output; metrics, plots, reports, etc., required to be output by pipeline

Need to update TMB calculation methods for pairs, unpaired samples

Need karyotype plot

Use a karyotype plot for

target regions & genes
coverage at target regions per sample

sometimes custom analysis report does not run

seems like there are edge cases where some channels might not be producing output required for the sample and custom analysis reports Nextflow pipeline processes to run

re-setup vaf distribution plot

The vaf_distribution_plot process was disabled, need to re enable it and include in report outputs

Fix failed tumor-normal comparison log in report

The table for the failed comparison log is out of order, the header row is being pushed further down the log and not coming out on top, need to update the logging channel in order to guarantee that this row is first

split bam_ra_rc_gatk process

Consider splitting the bam_ra_rc_gatk process into multiple processes; this step can take a very long to complete, possibly exceeding SLURM partition time limits in the event of system performance issues

need to update ref download methods

Consider incorporating the reference files downloads directly into the main pipeline instead of using separate 'ref.nf' and 'annovar_db.nf' pipelines for them. This would reduce pipeline management complexity and complexity for the end-user, at the expense of a slightly more complicated pipeline in 'main.nf'

Use full git tag in output tables

Instead of listing the most recent tag, the full current tag should be used since it contains the git commit, recent tag, and number of commits from the tag all in one.

masked nucleotides showing up in vcf files, breaking merge script

Lofreq variant caller seems to retain some of the lower-case masked nucleotides in the variants output in its .vcf files. These lower-case nucleotides end up getting converted to upper-case when running the .vcf through GATK VariantsToTable, but they are retained as lower-case when annotating the .vcf with ANNOVAR. This leads to errors when trying to merge the ANNOVAR annotation output with the .vcf tsv file from GATK, since the columns no longer match.

Consider forcing all nucleotides to upper case in the merge script. Or resolve this further upstream.

Need to adjust memory allocation for jobs

Default memory allocation for many SLURM jobs on NYU Big Purple HPC was recently reduced to 4GB. However, Nextflow reports suggest that for many jobs, this could be reduced further, which might help to increase job throughput.

Need to run some more evaluations of total memory usage for jobs and try to find parameters closer to memory minimum required for most jobs using typical datasets.

Evaluate Nextflow Tower integration

Need to refactor LoFreq and HaplotypeCaller to use separate filter and tsv processes

use the same processes as the new VarScan2 steps

Make 'no lane split' default for samplesheet via deployment setup

Add CNVKit targets annotation file creation method

Need to add an easy scripted method to create the refFlat targets annotation file for a new targets.bed file. Consider using a separate Nextflow pipeline since we will need to incorporate Singularity container to get CNVKit loaded

Need to support deployment with multiple fastq directories

Corresponds with NYU-Molecular-Pathology/demux-nf#8, need to be able to deploy a new pipeline with more than one fastq dir

Need to run genomic signatures for all variant callers

rename karyotype plots

Karyotype plot needs to be renamed to chromosome ideogram plot

remove process lofreq_filter_reformat from config

Need to remove the references to process lofreq_filter_reformat from the nextflow.config file, gives errors like this:

WARN: There's no process matching config selector: lofreq_filter_reformat

Need to refactor variable name in cnvkit process

cnvkit process includes variable ref_fai13, need to rename this and other ref fasta variables to ref_fai, etc.

Need to refactor pairs channels to include 'callerType'

Currently, LoFreqSomatic implements a "callerType" value, whereas MuTect2 implements "chunkLabel". LoFreqSomatic's "callerType" ends up getting passed in to the "chunkLabel" channel variable. This causes problems where its not possible to differentiate between MuTect2 chunk labels and LoFreqSomatic caller types downstream, including the latter being absent from the final annotation table.

Need to refactor the output channels for paired steps to include both "callerType" and "chunkLabel", and refactor all downstream processes that use them with this new cardinality.

Need output validator

Need to develop a validation method for the pipeline output, in order to determine if all required outputs are present. Consider developing this in Python, possibly as a unittest module, to be run upon pipeline completion.

Updated samplesheets do not seem to always work correctly with 'resume'

See comments here : nextflow-io/nextflow#1342

Consider using a Nextflow process to retrieve the inputs from a samplesheet, maybe this will work better since processes always re-run if the input (samplesheet) changes, not sure if Nextflow will always detect changes to external items such as samplesheets on 'resume'?

Need to include `finalize-work-rm` in with default `run` Makefile recipe

It is not useful to retain Nextflow work subdirs from old pipeline runs that are no longer being used, so the make finalize-work-rm recipe should be run by default upon the successful completion of a pipeline under either the make run or Big Purple specific recipes. This will greatly reduce the amount of space used without compromising the 'resume' functionality. However, this recipe can take a long time to complete and produces a lot of stdout messages so some consideration needs to be made in its implementation to not bloat the tail of the run log, since thats the primary way to easily tell if the pipeline succeeded or not

Need writeup for CNV Pool reference file generation workflow

Make sure we have a .md file or documentation for the CNV pool reference file usages and generation steps. See the HapMap Pool .md file for example.
Also include an example copy of the samplesheet in the example directory

Need to update coverage cutoffs

Coverage cutoffs need to be changed to:

50x, 100x, 200x, 300x, 400x, 500x

Refactor variant calling processes to use separate normalize_vcf steps

Right now most variant calling processes in the pipeline immediately normalize vcf files internally with bcftools using their own dedicated methods. Need to decouple these into the separate normalize_vcfs_pairs and normalize_vcfs tasks.

R installed in home dir breaks Singularity containers

We recently had errors with the deconstructSigs Singularity container deconstructSigs-1.8.0.simg:

Loading required package: GenomicRanges
Error: package or namespace load failed for ‘GenomicRanges’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
 namespace ‘XVector’ 0.18.0 is being loaded, but >= 0.19.8 is required
Error: package ‘GenomicRanges’ could not be loaded
Execution halted

It turns out that R was discovering two installations of the XVector library:

> installed.packages()[c(41, 117),]
        Package   LibPath
XVector "XVector" "/gpfs/home/vasudv02/R/x86_64-pc-linux-gnu-library/3.4"
XVector "XVector" "/conda/lib/R/library"
        Version  Priority
XVector "0.18.0" NA
XVector "0.20.0" NA
        Depends
XVector "R (>= 2.8.0), methods, BiocGenerics (>= 0.19.2), S4Vectors (>=\n0.15.14), IRanges (>= 2.9.18)"
XVector "R (>= 2.8.0), methods, BiocGenerics (>= 0.19.2), S4Vectors (>=\n0.17.24), IRanges (>= 2.13.16)"
        Imports
XVector "methods, zlibbioc, BiocGenerics, S4Vectors, IRanges"
XVector "methods, utils, zlibbioc, BiocGenerics, S4Vectors, IRanges"
        LinkingTo            Suggests                              Enhances
XVector "S4Vectors, IRanges" "Biostrings, drosophila2probe, RUnit" NA
XVector "S4Vectors, IRanges" "Biostrings, drosophila2probe, RUnit" NA
        License        License_is_FOSS License_restricts_use OS_type MD5sum
XVector "Artistic-2.0" NA              NA                    NA      NA
XVector "Artistic-2.0" NA              NA                    NA      NA
        NeedsCompilation Built
XVector "yes"            "3.4.2"
XVector "yes"            "3.4.1"

The version 0.18.0 of the library was in the user home directory /gpfs/home/vasudv02/R/x86_64-pc-linux-gnu-library/3.4. This conflicted with the conda-installed version 0.20.0 that was meant to be used inside the container, causing the error shown.

The solution in this case was to remove the user installation directory for R. However, it is expected that any user with an R installation that happens to coincide with the version of R inside the container might end up with the same problem. We need to figure out a way to prevent user home directory R installed libraries from being used inside Singularity containers. Note that by default, Singularity mounts the user's home directory inside the container, and by default R will look for libraries in the user home directory. So we might need to figure out if its possible to disable this feature or find some other work-around.

filter for fastq with no reads

Need to add some kind of channel filter or other sanity check to make sure that the fastq files have >0 reads in them.
Had a case where I 'finalize''d the output then resumed the run and the empty fastq from 'fastq-merge' was used for the pipeline and immediately broke, need to prevent such situations

resume not working with RealignerTargetCreator step

when trying to resume a completed pipeline, variant calling keeps getting started over again unnescessarily, starting at the RealignerTargetCreator, followed by the IndelRealigner step. Need to figure out why these steps are not 'resume'ing properly, its causing large delays in the re-processing and updating of old runs

Errors in `reformat-vcf-table.py` return exit code 0

When a error is encountered with the reformat-vcf-table.py script, but its stdout is being piped, the overall exit code of the processes is returned as 0, allowing broken files to pass through the Nextflow pipelines. Example:

reformat-vcf-table.py -c MuTect2 -s "${tumorID}" -i "${tsv_file}" | \
        paste-col.py --header "Sample" -v "${tumorID}"  > output.tsv

If an error is encountered by reformat-vcf-table.py here while it is in the middle of processing the file contents, a partial file will be created in output.tsv and the returned exit code of 0 will allow Nextflow to propagate output.tsv further down the pipeline.

Need to figure out how to ensure that non-zero exit code is returned in these cases.