mskcc / tempo Goto Github PK

CCS research pipeline to process WES and WGS TN pairs

Home Page: https://cmotempo.netlify.com/

Dockerfile 8.83% Groovy 54.36% Nextflow 1.52% Python 16.17% R 13.52% CSS 5.60%

tempo's Introduction

TEMPO

Tempo is a CMO Computational Sciences (CCS) research pipeline processing WES & WGS tumor-normal pairs using the Nextflow framework. Currently the pipeline is composed of alignment and QC, and detection of both somatic alterations and germline variants. Users can begin with inputs of either paired-end FASTQs or BAMs, and process these via the command line.

For further details of how to begin processing data with Tempo, please view our documentation. For contributing to this project, please make a pull request as detailed here.

The inspiration for this project derives from Sarek, developed at SciLifeLab.

Pipeline Flowchart

Directed Acyclic Graph

tempo's People

Contributors

Stargazers

Watchers

Forkers

gaom001 oliverartz johnoooh soccin haochenz96

tempo's Issues

NF PR in order to fix memory issue

nextflow-io/nextflow#1035

Recall that Philip found (and mentioned on gitter) that Nextflow was assuming that LSF is configured to fundamentally use MB as the unit for job submissions.

This is wrong. (A) Users set this to whatever they want and (B) on site for juno, we use GB.

The following should fix this issue.

Performance Enhancement for BAM preprocessing

Replacing samtools view/sort by sambamba view/sort aiming to improve the performance.

implement mutational signatures in NF

Docker issue here: #58

Add somatic.nf tests for Travis-CI

Travis-CI doesn't actually test somatic.nf.

This is fairly low-to-med priority

Discuss germline.nf

Targeting week of milestone 0.7.0 to have meeting

Implement Mark Duplicates before and after the merging of BAMs

The rationale here is that MD uses RG per BAM

DockerTimeout on AWS Batch

This issue happening sometimes because volume IO utilization is close to 100% while downloading and uploading files.
Error on AWS Batch:
DockerTimeoutError: Could not transition to created; timed out after waiting 4m0s
In this case whole nextflow job will fail.
Other manifestation of the same issue:
CannotInspectContainerError: Could not transition to inspecting; timed out after waiting 30s
In this case only some jobs are seen as failed on AWS Batch although they are successful.

Link to nextflow issue.

filtermutect2calls

Add filtermutect2calls merge in somatic.nf pipeline.

Finalize main_align_markDups_BaseRecal.nf to output (merged) BAMs with QC

Implement bam merge (check https://github.com/SciLifeLab/Sarek)
Implement QC step

Something to think about for germline, maybe: https://github.com/Illumina/canvas

Update cmopipeline containers to work on AWS

Update base docker images with the once who support running miniconda. (frolvlad/alpine-glibc:alpine-3.9 or migrate to ubuntu)

Define sample sheet and pairing design

Should this be generated at runtime from user input?

Necessary information

Tumor sample name and associated FASTQ pairs (and their readgroup name, optional)
Normal sample name and associated FASTQ pairs (and their readgroup name, optional)
Tumor-normal pairing
Assay (WGS, WES, if WES, which platform)
Genome build (GRCh37 by default)
Whether to do somatic variant calling (optional, run by default)
Whether to do germline variant calling (optional, not run by default)

Remove makeSamplesFile process?

https://github.com/mskcc/vaporware/blob/ecaaf3c5c049c49386ccdcec676a2192bafe548c/somatic.nf#L87

Could we remove this by just adding the echo call to the dellyCall process, i.e.:
delly call ... && echo "${idTumor}\ttumor\n${idNormal}\tcontrol" > samples.tsv?

Implement interval file usage

There are two scenarios currently, as I see it, that decide what interval files are used.

Exome sequencing

We don't need to consider the target capture kit to make a .bam file, but for some of the variant calling steps we do. We should put the target capture .bed files for Agilent and IDT exomes among our reference files. These are/have been used a lot in IGO recently and are available on our cluster:
/ifs/depot/pi/resources/roslin_resources/targets/AgilentExon_51MB_b37_v3
and /ifs/depot/pi/resources/roslin_resources/targets/IDT_Exome_v1_FP.
Some files might exists with a 5-bp padding at each side of target intervals, which I think is good practice. So we should make sure that's true for all exome capture beds we use–whether we do that at run time or not.

Genome sequencing

We make a .bam as above, but for some of the variant calling steps we limit it to "callable" regions of the genome. Broad has made these available in their bundle. Some SV callers provide their own files of this type, and we should probably evalute the differences between these at some point.

Note that I think we ought to focus on at least maintaining everything in GRCh37, but most files that exist on GRCh37 also exist on GRCh38. Are we using smallGRCh37 in references.config or can that be removed?

BAMQC at the RG and sample level

Change Alfred so that it makes both types of bam QCs

Implement SV caller, NovoBreak

We'll need the code:

https://github.com/czc/nb_distribution

And I don't think a Docker image exists, so it will need to be created

bcftools norm

Add bcftools norm in somatic.nf pipeline.

Implement interval bed for mutect2

Mutect2 is slow.

Let's implement interval bed for mutect2 and see if we can get it going a bit faster.

Use Sarek's pipeline as guidance.

Continuous Integration testing

CI should be able to run short pipeline tests (with small files) and the extension test suite. We can use Travis or Jenkins for this.

Document provenance of reference files

Source of all reference files need to be documented. Any manipulations to reference files need to be documented.

Make more informative travis-CI tests

Lumpyexpress

Add Lumpyexpress in somatic.nf pipeline.

Choose SV callers

Re: #63 and #64.

Per discussion in today's meeting we should pick two or a maximum of three SV callers. We need to determine on the basis of what we make this choice. I'm imaging we'll do Manta + Delly + one more, but I don't have a strong opinion on this.

-B option for bwa mem

do we need to give different mismatch penalty for Tumor and Normal samples? @kpjonsson @evanbiederstedt
https://github.com/mskcc/Sarek/blob/6a404605601f21bb748bba47c36993168349d060/main.nf#L170

Merge BAMs performance optimization

Current implementation waits for all SortBAMs to complete before grouping BAMs for merge. We can should probably optimize this by grouping them on FASTQ level and not waiting other Sort steps where it is not necessary.

Add QC for fastq and bam files in make_bam_and_qc.nf

bcftools merge

Add bcftools merge in somatic.nf pipeline.

Implement a function which will split the FASTQ input if the FASTQ input is greater than a certain size.

MarkDup .bai file

https://github.com/mskcc/vaporware/blob/43e7dc2ca70fb0b6c0df03969e4c2878ca5cc45b/make_bam_and_qc.nf#L176
https://github.com/mskcc/vaporware/blob/43e7dc2ca70fb0b6c0df03969e4c2878ca5cc45b/make_bam_and_qc.nf#L177

the .bai file need to be removed in the next release

Autoscale EBS for Amazon AMI

Setup autoscale EBS scripts on Amazon AMI.

Simplify container configs

Simplify containers.config and containers_lsf.config.

Perhaps create a resources config that defines how much to allocate to per resource that may be specific to the compute environment?

Fix RG assignment in `AlignReads`

Let's decide how we want to parse input sample sheets and feed a read group into BWA mem. It's currently hardcoded. This might depend on what input sample sheet format we land on.

Re-configure config files/infrastructure. Purge Sarek legacy code?

Mutational Signatures: create a Docker image for this, and push to Dockerhub

Create AWS setup environment script

Make the easy way of creating AWS Batch infrastructure for deployment, probably using CloudFormation, or Python + aws-cli. Update README accordingly.

MSI sensor

Add MSI sensor in somatic.nf pipeline.

Validate somatic.nf results on real data

Define WES and WGS example files to run through the pipeline scripts after every release

Create a set of WES test files and cases for to run agains develop branch

WGS will probably be kicked off manually for now

Implement SV caller, GRIDSS

https://github.com/PapenfussLab/gridss

Dockerfile: https://github.com/PapenfussLab/gridss/blob/master/docker/Dockerfile

Nextflow ToDo

Separate genomes in config
Add AWSbatch profile

Memory reservation problems between Nextflow and JUNO LSF Cluster settings

As noted by @kpjonsson in nextflow-io/nextflow#1071

How Nextflow submits jobs to JUNO is causing issues, namely because there are some snags between the way its LSF is configured and with how Nextflow bsub's jobs.

The problem comes from how Nextflow handles memory assignment through the bsub command and how it only does it per slot through bsub's -M flag instead of both -M and -R; both parameters are required in our JUNO LSF configuration.

See proposed fix here into from @gongyixiao, which DOES solve it for our purposes but has met some resistance from Nextflow with regards to adoption: mskcc/nextflow@442ce07

We should discuss how to proceed - whether we're fine with the workaround above in the event that Nextflow decides not to incorporate our changes (and all the ramifications of such a decision); whether we should push to get it added to the main Nextflow code; or if we should find an alternative solution.

Caused by:
  Failed to pull singularity image
  command: singularity pull --name cmopipeline-htstools-0.1.1.img docker://cmopipeline/htstools:0.1.1 > /dev/null
  status : 255
  message:
    [33mWARNING: Authentication token file not found : Only pulls of public images will succeed
    INFO:    Starting build...
    Getting image source signatures
    Skipping fetch of repeat blob sha256:e53f134edff2c9a6928199bfbd8d0e70c1ecfcb4b5b70462028062f567a528f7
    Skipping fetch of repeat blob sha256:efbbd466a715ba1ee85664ed1e1fe53c3cb54759225eef1869a9b27179ea675f
    Skipping fetch of repeat blob sha256:e11368b8e0c73f08ef1deb948c24a8cfd2307a8eb138a0caf77bdfe4a4722d99
    Skipping fetch of repeat blob sha256:7dab2de7692bef415de0b332748c99d8949a7768add945030191c72a42e80511
    Skipping fetch of repeat blob sha256:c061951b6186beca8ad002e49d0066f90c340545bf9e3b34195edcbfaec618f8
    Copying config sha256:8b0b9fc266d2f7cdb4fb3cf5e00b734ff50a3f999d894738b6f76826610c7b21
     0 B / 5.25 KiB [--------------------------------------------------------------]
     5.25 KiB / 5.25 KiB [======================================================] 0s
    Writing manifest to image destination
    Storing signatures
    FATAL:   Unable to pull docker://cmopipeline/htstools:0.1.1: conveyor failed to get: no descriptor found for reference "a8a2bf6dabba1849d41d3598bd69dc96e9f93a0029cfcc0259f4bf7f2f8e3814"

"intervals" setting in reference file for WGS and WES need to be different, affecting BQSR

intervals need to use wgs_calling_regions_CAW.list for WGS and Broad.human.exome.b37.interval_list for WES in ApplyBQSR step. @kpjonsson @evanbiederstedt can confirm this.

We might need a parameter for the pipeline to control it, instead of hard coded it.