Giter Site home page Giter Site logo

nyu-molecular-pathology / ngs580-nf Goto Github PK

View Code? Open in Web Editor NEW
10.0 10.0 6.0 1.63 MB

Target exome sequencing analysis for NYU NGS580 gene panel

License: GNU General Public License v3.0

Nextflow 48.64% Makefile 6.53% TeX 1.50% Shell 1.76% Python 21.86% R 11.08% Roff 3.30% Ruby 0.12% CSS 0.26% Dockerfile 4.94%
docker exome exome-sequencing-analysis nextflow pipeline singularity

ngs580-nf's People

Contributors

stevekm avatar varshini712 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ngs580-nf's Issues

LoFreq Somatic output include dbSNP variants

LoFreq Somatic variant calling outputs snvs and indels with and without dbSNP; need to make sure to update pipeline to use the variants with dbSNP, do not exclude dbSNP variants.

need to refactor reference file staging methods

In conjunction with #9 need to consider alternative staging methods for reference files, especially in cases where a path to an entire directory is passed such as for ANNOVAR databases and some genome.fa files. Files should be staged directly, instead of staging the entire directory or passing just dir path.
Stage-in modes such as 'copy' could be combined with 'scratchDir' for a potential speedup for processing files directly out of HPC node NVMe SSD space with reduced GPFS overhead. Also consider things like stage-in via RAM disk especially for items like ANNOVAR databases, not sure if this will be feasible due to the extremely large storage requirements for them. Might need to split ANNOVAR annotation into separate processes for each database and stage each db individually then recombine.

Need to refactor channel `create` method

Need to go through the pipeline and update sections that use the create method, due to messages from Nextflow:

WARN: The channel `create` method is deprecated -- it will be removed in a future release

Will probably be needed as we migrate to newer versions of Nextflow to utilize new features.

docker build error

Hi!

Thank you for your code. (NGS580-nf).

When I try to build of Docker, I got the error in below.

$ git clone https://github.com/NYU-Molecular-Pathology/NGS580-nf.git
$ cd NGS580-nf/containers
$ make build-all-Docker

2019-01-03 09 34 28

How can I solve this issue ? Please check this issue.

Update QC report metrics

Need to update the analysis-wide QC report to better reflect quality metrics, include descriptions, describe & visualize cutoffs & thresholds, etc

need bash trap for .nextflow.submitted

Need to update the Makefile submit recipe to use a bash trap to catch kill commands and automatically remove the .nextflow.submitted lock file. Otherwise you have to always remember to delete .nextflow.submitted every time you use make kill which gets annoying

Remove 'done' channels

Many empty channels such as "done_copy_samplesheet" were previously used to force the custom reporting steps to wait until the end of the pipeline before running. This is no longer necessary, these channels should be removed.

Add configs for different targets.bed files

Need to add an easy interface to specify which targets.bed file to use, from a selection of several included choices. Need to update the analysis report to clearly state the filename and targets file choice in order to differentiate between e.g. 580, 629, and future targets.

Need scope writeup

Need writeup of scope for pipeline output; metrics, plots, reports, etc., required to be output by pipeline

Need karyotype plot

Use a karyotype plot for

  • target regions & genes
  • coverage at target regions per sample

Fix failed tumor-normal comparison log in report

The table for the failed comparison log is out of order, the header row is being pushed further down the log and not coming out on top, need to update the logging channel in order to guarantee that this row is first

split bam_ra_rc_gatk process

Consider splitting the bam_ra_rc_gatk process into multiple processes; this step can take a very long to complete, possibly exceeding SLURM partition time limits in the event of system performance issues

need to update ref download methods

Consider incorporating the reference files downloads directly into the main pipeline instead of using separate 'ref.nf' and 'annovar_db.nf' pipelines for them. This would reduce pipeline management complexity and complexity for the end-user, at the expense of a slightly more complicated pipeline in 'main.nf'

Use full git tag in output tables

Instead of listing the most recent tag, the full current tag should be used since it contains the git commit, recent tag, and number of commits from the tag all in one.

masked nucleotides showing up in vcf files, breaking merge script

Lofreq variant caller seems to retain some of the lower-case masked nucleotides in the variants output in its .vcf files. These lower-case nucleotides end up getting converted to upper-case when running the .vcf through GATK VariantsToTable, but they are retained as lower-case when annotating the .vcf with ANNOVAR. This leads to errors when trying to merge the ANNOVAR annotation output with the .vcf tsv file from GATK, since the columns no longer match.

Consider forcing all nucleotides to upper case in the merge script. Or resolve this further upstream.

Need to adjust memory allocation for jobs

Default memory allocation for many SLURM jobs on NYU Big Purple HPC was recently reduced to 4GB. However, Nextflow reports suggest that for many jobs, this could be reduced further, which might help to increase job throughput.

Screen Shot 2019-10-21 at 1 21 25 PM

Screen Shot 2019-10-21 at 1 21 05 PM

Need to run some more evaluations of total memory usage for jobs and try to find parameters closer to memory minimum required for most jobs using typical datasets.

Add CNVKit targets annotation file creation method

Need to add an easy scripted method to create the refFlat targets annotation file for a new targets.bed file. Consider using a separate Nextflow pipeline since we will need to incorporate Singularity container to get CNVKit loaded

Need to refactor pairs channels to include 'callerType'

Currently, LoFreqSomatic implements a "callerType" value, whereas MuTect2 implements "chunkLabel". LoFreqSomatic's "callerType" ends up getting passed in to the "chunkLabel" channel variable. This causes problems where its not possible to differentiate between MuTect2 chunk labels and LoFreqSomatic caller types downstream, including the latter being absent from the final annotation table.

Need to refactor the output channels for paired steps to include both "callerType" and "chunkLabel", and refactor all downstream processes that use them with this new cardinality.

Need output validator

Need to develop a validation method for the pipeline output, in order to determine if all required outputs are present. Consider developing this in Python, possibly as a unittest module, to be run upon pipeline completion.

Need to include `finalize-work-rm` in with default `run` Makefile recipe

It is not useful to retain Nextflow work subdirs from old pipeline runs that are no longer being used, so the make finalize-work-rm recipe should be run by default upon the successful completion of a pipeline under either the make run or Big Purple specific recipes. This will greatly reduce the amount of space used without compromising the 'resume' functionality. However, this recipe can take a long time to complete and produces a lot of stdout messages so some consideration needs to be made in its implementation to not bloat the tail of the run log, since thats the primary way to easily tell if the pipeline succeeded or not

R installed in home dir breaks Singularity containers

We recently had errors with the deconstructSigs Singularity container deconstructSigs-1.8.0.simg:

Loading required package: GenomicRanges
Error: package or namespace load failed for ‘GenomicRanges’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
 namespace ‘XVector’ 0.18.0 is being loaded, but >= 0.19.8 is required
Error: package ‘GenomicRanges’ could not be loaded
Execution halted

It turns out that R was discovering two installations of the XVector library:

> installed.packages()[c(41, 117),]
        Package   LibPath
XVector "XVector" "/gpfs/home/vasudv02/R/x86_64-pc-linux-gnu-library/3.4"
XVector "XVector" "/conda/lib/R/library"
        Version  Priority
XVector "0.18.0" NA
XVector "0.20.0" NA
        Depends
XVector "R (>= 2.8.0), methods, BiocGenerics (>= 0.19.2), S4Vectors (>=\n0.15.14), IRanges (>= 2.9.18)"
XVector "R (>= 2.8.0), methods, BiocGenerics (>= 0.19.2), S4Vectors (>=\n0.17.24), IRanges (>= 2.13.16)"
        Imports
XVector "methods, zlibbioc, BiocGenerics, S4Vectors, IRanges"
XVector "methods, utils, zlibbioc, BiocGenerics, S4Vectors, IRanges"
        LinkingTo            Suggests                              Enhances
XVector "S4Vectors, IRanges" "Biostrings, drosophila2probe, RUnit" NA
XVector "S4Vectors, IRanges" "Biostrings, drosophila2probe, RUnit" NA
        License        License_is_FOSS License_restricts_use OS_type MD5sum
XVector "Artistic-2.0" NA              NA                    NA      NA
XVector "Artistic-2.0" NA              NA                    NA      NA
        NeedsCompilation Built
XVector "yes"            "3.4.2"
XVector "yes"            "3.4.1"

The version 0.18.0 of the library was in the user home directory /gpfs/home/vasudv02/R/x86_64-pc-linux-gnu-library/3.4. This conflicted with the conda-installed version 0.20.0 that was meant to be used inside the container, causing the error shown.

The solution in this case was to remove the user installation directory for R. However, it is expected that any user with an R installation that happens to coincide with the version of R inside the container might end up with the same problem. We need to figure out a way to prevent user home directory R installed libraries from being used inside Singularity containers. Note that by default, Singularity mounts the user's home directory inside the container, and by default R will look for libraries in the user home directory. So we might need to figure out if its possible to disable this feature or find some other work-around.

filter for fastq with no reads

Need to add some kind of channel filter or other sanity check to make sure that the fastq files have >0 reads in them.
Had a case where I 'finalize''d the output then resumed the run and the empty fastq from 'fastq-merge' was used for the pipeline and immediately broke, need to prevent such situations

resume not working with RealignerTargetCreator step

when trying to resume a completed pipeline, variant calling keeps getting started over again unnescessarily, starting at the RealignerTargetCreator, followed by the IndelRealigner step. Need to figure out why these steps are not 'resume'ing properly, its causing large delays in the re-processing and updating of old runs

Errors in `reformat-vcf-table.py` return exit code 0

When a error is encountered with the reformat-vcf-table.py script, but its stdout is being piped, the overall exit code of the processes is returned as 0, allowing broken files to pass through the Nextflow pipelines. Example:

reformat-vcf-table.py -c MuTect2 -s "${tumorID}" -i "${tsv_file}" | \
        paste-col.py --header "Sample" -v "${tumorID}"  > output.tsv

If an error is encountered by reformat-vcf-table.py here while it is in the middle of processing the file contents, a partial file will be created in output.tsv and the returned exit code of 0 will allow Nextflow to propagate output.tsv further down the pipeline.

Need to figure out how to ensure that non-zero exit code is returned in these cases.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.