Giter Site home page Giter Site logo

tempo's Introduction

Build Status

TEMPO

Tempo is a CMO Computational Sciences (CCS) research pipeline processing WES & WGS tumor-normal pairs using the Nextflow framework. Currently the pipeline is composed of alignment and QC, and detection of both somatic alterations and germline variants. Users can begin with inputs of either paired-end FASTQs or BAMs, and process these via the command line.

For further details of how to begin processing data with Tempo, please view our documentation. For contributing to this project, please make a pull request as detailed here.

The inspiration for this project derives from Sarek, developed at SciLifeLab.

Pipeline Flowchart

Directed Acyclic Graph

tempo's People

Contributors

evanbiederstedt avatar gongyixiao avatar allanbolipata avatar kpjonsson avatar sivkovic avatar anoronh4 avatar junwoo2 avatar nikhil avatar hweej avatar

Stargazers

 avatar  avatar Hongxin avatar  avatar  avatar DMTR avatar Will Polar avatar  avatar Stephen Kelly avatar Bastien Nguyen avatar  avatar  avatar

Watchers

Pamela M avatar Adam Price avatar  avatar Ronak Shah avatar  avatar  avatar  avatar  avatar  avatar Chai Bandlamudi avatar Mihir Kavatkar avatar  avatar Stephen Kelly avatar David YunTe Lin avatar  avatar Shweta Chavan avatar  avatar  avatar Will Polar avatar

tempo's Issues

NF PR in order to fix memory issue

nextflow-io/nextflow#1035

Recall that Philip found (and mentioned on gitter) that Nextflow was assuming that LSF is configured to fundamentally use MB as the unit for job submissions.

This is wrong. (A) Users set this to whatever they want and (B) on site for juno, we use GB.

The following should fix this issue.

DockerTimeout on AWS Batch

This issue happening sometimes because volume IO utilization is close to 100% while downloading and uploading files.
Error on AWS Batch:
DockerTimeoutError: Could not transition to created; timed out after waiting 4m0s
In this case whole nextflow job will fail.
Other manifestation of the same issue:
CannotInspectContainerError: Could not transition to inspecting; timed out after waiting 30s
In this case only some jobs are seen as failed on AWS Batch although they are successful.

Link to nextflow issue.

Define sample sheet and pairing design

Should this be generated at runtime from user input?

Necessary information

  • Tumor sample name and associated FASTQ pairs (and their readgroup name, optional)
  • Normal sample name and associated FASTQ pairs (and their readgroup name, optional)
  • Tumor-normal pairing
  • Assay (WGS, WES, if WES, which platform)
  • Genome build (GRCh37 by default)
  • Whether to do somatic variant calling (optional, run by default)
  • Whether to do germline variant calling (optional, not run by default)

Implement interval file usage

There are two scenarios currently, as I see it, that decide what interval files are used.

Exome sequencing

We don't need to consider the target capture kit to make a .bam file, but for some of the variant calling steps we do. We should put the target capture .bed files for Agilent and IDT exomes among our reference files. These are/have been used a lot in IGO recently and are available on our cluster:
/ifs/depot/pi/resources/roslin_resources/targets/AgilentExon_51MB_b37_v3
and /ifs/depot/pi/resources/roslin_resources/targets/IDT_Exome_v1_FP.
Some files might exists with a 5-bp padding at each side of target intervals, which I think is good practice. So we should make sure that's true for all exome capture beds we use–whether we do that at run time or not.

Genome sequencing

We make a .bam as above, but for some of the variant calling steps we limit it to "callable" regions of the genome. Broad has made these available in their bundle. Some SV callers provide their own files of this type, and we should probably evalute the differences between these at some point.

Note that I think we ought to focus on at least maintaining everything in GRCh37, but most files that exist on GRCh37 also exist on GRCh38. Are we using smallGRCh37 in references.config or can that be removed?

Implement interval bed for mutect2

Mutect2 is slow.

Let's implement interval bed for mutect2 and see if we can get it going a bit faster.

Use Sarek's pipeline as guidance.

Continuous Integration testing

CI should be able to run short pipeline tests (with small files) and the extension test suite. We can use Travis or Jenkins for this.

Choose SV callers

Re: #63 and #64.

Per discussion in today's meeting we should pick two or a maximum of three SV callers. We need to determine on the basis of what we make this choice. I'm imaging we'll do Manta + Delly + one more, but I don't have a strong opinion on this.

Merge BAMs performance optimization

Current implementation waits for all SortBAMs to complete before grouping BAMs for merge. We can should probably optimize this by grouping them on FASTQ level and not waiting other Sort steps where it is not necessary.

Simplify container configs

Simplify containers.config and containers_lsf.config.

Perhaps create a resources config that defines how much to allocate to per resource that may be specific to the compute environment?

Fix RG assignment in `AlignReads`

Let's decide how we want to parse input sample sheets and feed a read group into BWA mem. It's currently hardcoded. This might depend on what input sample sheet format we land on.

Create AWS setup environment script

Make the easy way of creating AWS Batch infrastructure for deployment, probably using CloudFormation, or Python + aws-cli. Update README accordingly.

MSI sensor

Add MSI sensor in somatic.nf pipeline.

Memory reservation problems between Nextflow and JUNO LSF Cluster settings

As noted by @kpjonsson in nextflow-io/nextflow#1071

How Nextflow submits jobs to JUNO is causing issues, namely because there are some snags between the way its LSF is configured and with how Nextflow bsub's jobs.

The problem comes from how Nextflow handles memory assignment through the bsub command and how it only does it per slot through bsub's -M flag instead of both -M and -R; both parameters are required in our JUNO LSF configuration.

See proposed fix here into from @gongyixiao, which DOES solve it for our purposes but has met some resistance from Nextflow with regards to adoption: mskcc/nextflow@442ce07

We should discuss how to proceed - whether we're fine with the workaround above in the event that Nextflow decides not to incorporate our changes (and all the ramifications of such a decision); whether we should push to get it added to the main Nextflow code; or if we should find an alternative solution.

singularity image pull fails with "no descriptor found for reference" error

If the singularity image hasn't already been downloaded to a directory set with NXF_SINGULARITY_CACHEDIR, sometimes it fails on pull.

Running the command outside of Nextflow succeeds by itself, however.

Caused by:
  Failed to pull singularity image
  command: singularity pull --name cmopipeline-htstools-0.1.1.img docker://cmopipeline/htstools:0.1.1 > /dev/null
  status : 255
  message:
    [33mWARNING: Authentication token file not found : Only pulls of public images will succeed
    INFO:    Starting build...
    Getting image source signatures
    Skipping fetch of repeat blob sha256:e53f134edff2c9a6928199bfbd8d0e70c1ecfcb4b5b70462028062f567a528f7
    Skipping fetch of repeat blob sha256:efbbd466a715ba1ee85664ed1e1fe53c3cb54759225eef1869a9b27179ea675f
    Skipping fetch of repeat blob sha256:e11368b8e0c73f08ef1deb948c24a8cfd2307a8eb138a0caf77bdfe4a4722d99
    Skipping fetch of repeat blob sha256:7dab2de7692bef415de0b332748c99d8949a7768add945030191c72a42e80511
    Skipping fetch of repeat blob sha256:c061951b6186beca8ad002e49d0066f90c340545bf9e3b34195edcbfaec618f8
    Copying config sha256:8b0b9fc266d2f7cdb4fb3cf5e00b734ff50a3f999d894738b6f76826610c7b21
     0 B / 5.25 KiB [--------------------------------------------------------------]
     5.25 KiB / 5.25 KiB [======================================================] 0s
    Writing manifest to image destination
    Storing signatures
    FATAL:   Unable to pull docker://cmopipeline/htstools:0.1.1: conveyor failed to get: no descriptor found for reference "a8a2bf6dabba1849d41d3598bd69dc96e9f93a0029cfcc0259f4bf7f2f8e3814" 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.