Giter Site home page Giter Site logo

broadinstitute / epi-share-seq-pipeline Goto Github PK

View Code? Open in Web Editor NEW
14.0 8.0 2.0 79.14 MB

Epigenomics Program pipeline to analyze SHARE-seq data.

License: MIT License

R 11.14% Python 23.45% Dockerfile 7.17% Jupyter Notebook 21.75% Shell 0.92% Vim Script 0.48% WDL 35.09%

epi-share-seq-pipeline's Introduction

Broad Institute of MIT and Harvard Single-Cell/Nucleus Multiomic Processing Pipeline

Pipeline specifications can be found here.

Pipeline main page on dockstore.

Pipeline overview.

Structure of this repo

  • The tasks directory contains the tasks called from the main workflow share-seq.wdl. Each task corresponds to a different step of the pipeline: align, filter, etc.
  • The src directory contains bash, Python, R, and notebook scripts called within the tasks.
  • The dockerfiles directory contains the Dockerfiles used to build the Docker images used by the pipeline.

Introduction

The SHARE-seq multiomic pipeline is based off the original Buenrostro SHARE-seq pipeline specifications (by Sai Ma) in this github repo.

This 10X single-cell multiomic pipeline is based off the ENCODE (phase-3) single-cell pipeline specifications (by Anshul Kundaje) in this google doc.

Features

  • Portability: The pipeline can be run on different cloud platforms such as Google, AWS and DNAnexus, as well as on cluster engines such as SLURM, SGE and PBS.
  • User-friendly HTML report: In addition to the standard outputs, the pipeline generates an HTML report that consists of quality metrics including alignment statistics along with many useful plots. An example of the HTML report. # TODO: add an example html.
  • Supported genomes: The pipeline requires genome-specific data such as aligner indices, chromosome sizes, and blacklisted regions. We provide genome references for hg38, mm10, mm39.

Installation

  1. Install Caper (Python Wrapper/CLI for Cromwell).

    $ pip install caper
  2. IMPORTANT: Read Caper's README carefully to choose a backend for your system. Follow the instructions in the configuration file.

    # backend: local or your HPC type (e.g. slurm, sge, pbs, lsf). read Caper's README carefully.
    $ caper init [YOUR_BACKEND]
    
    # IMPORTANT: edit the conf file and follow commented instructions in there
    $ vi ~/.caper/default.conf
  3. Git clone this pipeline.

    $ cd
    $ git clone https://github.com/broadinstitute/epi-SHARE-seq-pipeline/ #TODO: This should point to the release
  4. Define test input JSON.

    INPUT_JSON="" #TODO: We need a test dataset available for everyone
  5. If you have Docker and want to run the pipelines locally on your laptop, --max-concurrent-tasks 1 limits the number of concurrent tasks to test-run on a laptop. Uncomment if running on a workstation/HPC.

    # check if Docker works on your machine
    $ docker run ubuntu:latest echo hello
    
    # --max-concurrent-tasks 1 is for computers with limited resources
    $ caper run share-seq.wdl -i "${INPUT_JSON}" --docker --max-concurrent-tasks 1
  6. Otherwise, install Singularity on your system. Please follow these instructions to install Singularity on a Debian-based OS. Or ask your system administrator to install Singularity on your HPC.

    # check if Singularity works on your machine
    $ singularity exec docker://ubuntu:latest echo hello
    
    # on your local machine (--max-concurrent-tasks 1 is for computers with limited resources)
    $ caper run share-seq.wdl -i "${INPUT_JSON}" --singularity --max-concurrent-tasks 1
    
    # on HPC, make sure that Caper's conf ~/.caper/default.conf is correctly configured to work with your HPC
    # the following command will submit Caper as a leader job to SLURM with Singularity
    $ caper hpc submit share-seq.wdl -i "${INPUT_JSON}" --singularity --leader-job-name ANY_GOOD_LEADER_JOB_NAME
    
    # check job ID and status of your leader jobs
    $ caper hpc list
    
    # cancel the leader node to close all of its children jobs
    # If you directly use cluster command like scancel or qdel then
    # child jobs will not be terminated
    $ caper hpc abort [JOB_ID]

Input JSON file

IMPORTANT: DO NOT BLINDLY USE A TEMPLATE/EXAMPLE INPUT JSON. READ THROUGH THE FOLLOWING GUIDE TO MAKE A CORRECT INPUT JSON FILE.

An input JSON file specifies all of the input parameters and files that are necessary for successfully running this pipeline. This includes a specification of the path to the genome reference files and the raw data FASTQ files. Please make sure to specify absolute paths rather than relative paths in your input JSON files.

  1. Input JSON file specification (short)
  2. Input JSON file specification (long)

Running on Terra/Anvil (using Dockstore)

Visit our pipeline repo on Dockstore. Click on Terra or Anvil. Follow Terra's instructions to create a workspace on Terra and add Terra's billing bot to your Google Cloud account.

Download this [test input JSON for Terra](we don't have one at the moment), upload it to Terra's UI, and then run the analysis.

If you would like to use your own input JSON file, make sure that all files in the input JSON are on a Google Cloud Storage bucket (gs://). URLs will not work.

How to organize outputs

Install Croo. Make sure that you have python3(> 3.4.1) installed on your system. Find a metadata.json on Caper's output directory.

$ pip install croo
$ croo [METADATA_JSON_FILE]

How to make a spreadsheet of QC metrics

Install qc2tsv. Make sure that you have Python 3 (>3.4.1) installed on your system.

Once you have organized the output with Croo, you will be able to find the pipeline's final output file qc/qc.json which contains all the QC metrics. Simply feed qc2tsv with multiple qc.json files. It can take various URIs such as local paths, gs://, and s3://.

$ pip install qc2tsv
$ qc2tsv /sample1/qc.json gs://sample2/qc.json s3://sample3/qc.json ... > spreadsheet.tsv

QC metrics for each experiment (qc.json) will be split into multiple rows (1 for overall experiment + 1 for each bio replicate) in a spreadsheet.


TODO:\ Sambamba\ add track generation \

Thank you to the ENCODE DAC for writing excellent documentation for their pipelines that we used as templates.

epi-share-seq-pipeline's People

Contributors

emattei avatar kdong2395 avatar lzj1769 avatar mei-knudson avatar nchernia avatar sidwekhande avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

lzj1769 ruochiz

epi-share-seq-pipeline's Issues

pretty plate QC

We should incorporate Surya's code for the plate QC into demux.

no way to force exit jupyter nb execution via papermill

Cannot find a way to force exit jupyter notebook running IRkernel via papermill. This mainly affects the DORCs correlation function, which goes into an infinite loop if there are no common barcodes between RNA and ATAC matrices.

  • papermill ignores R system quit() , stop() commands

Ambient RNA

Mei checked cellbender and it seems that it did not affect the results. Check with Jason to learn more about this

WorkflowManagerActor: Workflow f5905e43-a5fe-423c-be74-3afbfc0d5540 failed

When I running dockstore, I got this error message

[info] WorkflowManagerActor: Workflow f5905e43-a5fe-423c-be74-3afbfc0d5540 failed (during MaterializingWorkflowDescriptorState): cromwell.engine.workflow.lifecycle.materialization.Mater
ializeWorkflowDescriptorActor$$anon$1: Workflow input processing failed:
Failed to evaluate input 'read2_atac' (reason 1 of 1): No coercion defined from '"/home/younso/dockstore_test/cromwell-input/23968040-60bc-49a0-81a3-be2416675905/home/younso/scRNA/share-raw/sp.atac.R1.fastq.gz"
' of type 'spray.json.JsString' to 'Array[File]'.
Failed to evaluate input 'read1_rna' (reason 1 of 1): No coercion defined from '"/home/younso/dockstore_test/cromwell-input/23968040-60bc-49a0-81a3-be2416675905/home/younso/scRNA/share-raw/sp.rna.R1.fastq.gz"'
of type 'spray.json.JsString' to 'Array[File]'.
Failed to evaluate input 'save_plots_to_dir' (reason 1 of 1): No coercion defined from 'true' of type 'spray.json.JsTrue$' to 'String'.
Failed to evaluate input 'read1_atac' (reason 1 of 1): No coercion defined from '"/home/younso/dockstore_test/cromwell-input/23968040-60bc-49a0-81a3-be2416675905/home/younso/scRNA/share-raw/sp.atac.R2.fastq.gz"
' of type 'spray.json.JsString' to 'Array[File]'.
at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.cromwell$engine$workflow$lifecycle$materialization$MaterializeWorkflowDescriptorActor$$workflowInitializationFail
ed(MaterializeWorkflowDescriptorActor.scala:257)
at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anonfun$2.applyOrElse(MaterializeWorkflowDescriptorActor.scala:227)
at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anonfun$2.applyOrElse(MaterializeWorkflowDescriptorActor.scala:222)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at akka.actor.FSM.processEvent(FSM.scala:707)
at akka.actor.FSM.processEvent$(FSM.scala:704)
at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.akka$actor$LoggingFSM$$super$processEvent(MaterializeWorkflowDescriptorActor.scala:169)
at akka.actor.LoggingFSM.processEvent(FSM.scala:847)
at akka.actor.LoggingFSM.processEvent$(FSM.scala:829)
at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.processEvent(MaterializeWorkflowDescriptorActor.scala:169)
at akka.actor.FSM.akka$actor$FSM$$processMsg(FSM.scala:701)
at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:695)
at akka.actor.Actor.aroundReceive(Actor.scala:539)
at akka.actor.Actor.aroundReceive$(Actor.scala:537)
at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.aroundReceive(MaterializeWorkflowDescriptorActor.scala:169)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:614)
at akka.actor.ActorCell.invoke(ActorCell.scala:583)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
at akka.dispatch.Mailbox.run(Mailbox.scala:229)
at akka.dispatch.Mailbox.exec(Mailbox.scala:241)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

my command is below

dockstore workflow launch --entry github.com/broadinstitute/epi-SHARE-seq-pipeline/SHARE-seq:release --json test_Dockstore.json

and the information of test_Dockstore.json are

{
"ShareSeq.gtf": "/home/younso/dockstore_test/hg19.genes.gtf",
"ShareSeq.remove_single_umi": false,
"ShareSeq.docker_image_dorcs": "/home/younso/bin/epi-SHARE-seq-pipeline/dockerfiles",
"ShareSeq.maxFeature_RNA": "2500",
"ShareSeq.read1_rna": "/home/younso/scRNA/share-raw/sp.rna.R1.fastq.gz",
"ShareSeq.read2_rna": "/home/younso/scRNA/share-raw/sp.rna.R2.fastq.gz",
"ShareSeq.corrPVal": "0.05",
"ShareSeq.cpus_atac": "20",
"ShareSeq.dorcs.chunkSize": "50000",
"ShareSeq.joint.min_genes": "200",
"ShareSeq.cpus_dorcs": "20",
"ShareSeq.read2_atac": "/home/younso/scRNA/share-raw/sp.atac.R1.fastq.gz",
"ShareSeq.cpus": "20",
"ShareSeq.topNGene": "30",
"ShareSeq.minCells_RNA": "3",
"ShareSeq.idx_tar_atac": "/home/younso/dockstore_test/hg19.tar.gz",
"ShareSeq.joint.min_umis": "50",
"ShareSeq.include_multimappers": false,
"ShareSeq.rna.umap_dim": "10",
"ShareSeq.joint.mem_gb": "64",
"ShareSeq.cutoff_rna": "100",
"ShareSeq.minFeature_RNA": "200",
"ShareSeq.multimappers": false,
"ShareSeq.qc": false,
"ShareSeq.idx_tar_rna": "/home/younso/dockstore_test/hg19.tar.gz",
"ShareSeq.joint.min_frags": "100",
"ShareSeq.genome_name_input": "hg19",
"ShareSeq.percentMT_RNA": "5",
"ShareSeq.dorcs.numBackgroundPairs": "100000",
"ShareSeq.prefix": "shareseq-project",
"ShareSeq.save_plots_to_dir": TRUE,
"ShareSeq.joint.disk_gb": "50",
"ShareSeq.rna.umap_resolution": "0.5",
"ShareSeq.windowPadSize": "50000",
"ShareSeq.dorcs.disk_gb": "100",
"ShareSeq.fripCutOff": "0.3",
"ShareSeq.cutoff_atac": "100",
"ShareSeq.mem_gb_dorcs": "64",
"ShareSeq.dorcs.numNearestNeighbor": "30",
"ShareSeq.joint.umi_metrics_cutoff": "10",
"ShareSeq.dorcGeneCutOff": "10",
"ShareSeq.mode": "fast",
"ShareSeq.read1_atac": "/home/younso/scRNA/share-raw/sp.atac.R2.fastq.gz",
"ShareSeq.cpus_rna": "20",
"ShareSeq.joint.min_tss": "4",
"ShareSeq.include_introns": "true",
"ShareSeq.gene_naming": "gene_name"
}

In test_Dockstore.json, there are almost contents of Dockstore.json except some features that I didn't know what should I fill
As I inferred, it seems that more than two paths should be entered in ShareSeq.read1_rna, but the example file contains only one path and there is only one pair of files.
Do I have to put both reads in read1 without caring about read1 and read2?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.