broadinstitute / epi-share-seq-pipeline Goto Github PK

View Code? Open in Web Editor NEW

14.0 8.0 2.0 79.14 MB

Epigenomics Program pipeline to analyze SHARE-seq data.

License: MIT License

R 11.14% Python 23.45% Dockerfile 7.17% Jupyter Notebook 21.75% Shell 0.92% Vim Script 0.48% WDL 35.09%

epi-share-seq-pipeline's Introduction

Broad Institute of MIT and Harvard Single-Cell/Nucleus Multiomic Processing Pipeline

Pipeline specifications can be found here.

Pipeline main page on dockstore.

Structure of this repo

The tasks directory contains the tasks called from the main workflow share-seq.wdl. Each task corresponds to a different step of the pipeline: align, filter, etc.
The src directory contains bash, Python, R, and notebook scripts called within the tasks.
The dockerfiles directory contains the Dockerfiles used to build the Docker images used by the pipeline.

Introduction

The SHARE-seq multiomic pipeline is based off the original Buenrostro SHARE-seq pipeline specifications (by Sai Ma) in this github repo.

This 10X single-cell multiomic pipeline is based off the ENCODE (phase-3) single-cell pipeline specifications (by Anshul Kundaje) in this google doc.

Features

Portability: The pipeline can be run on different cloud platforms such as Google, AWS and DNAnexus, as well as on cluster engines such as SLURM, SGE and PBS.
User-friendly HTML report: In addition to the standard outputs, the pipeline generates an HTML report that consists of quality metrics including alignment statistics along with many useful plots. An example of the HTML report. # TODO: add an example html.
Supported genomes: The pipeline requires genome-specific data such as aligner indices, chromosome sizes, and blacklisted regions. We provide genome references for hg38, mm10, mm39.

Installation

Install Caper (Python Wrapper/CLI for Cromwell).
```
$ pip install caper
```

IMPORTANT: Read Caper's README carefully to choose a backend for your system. Follow the instructions in the configuration file.

# backend: local or your HPC type (e.g. slurm, sge, pbs, lsf). read Caper's README carefully.
$ caper init [YOUR_BACKEND]

# IMPORTANT: edit the conf file and follow commented instructions in there
$ vi ~/.caper/default.conf

Git clone this pipeline.

$ cd
$ git clone https://github.com/broadinstitute/epi-SHARE-seq-pipeline/ #TODO: This should point to the release

Define test input JSON.

INPUT_JSON="" #TODO: We need a test dataset available for everyone

If you have Docker and want to run the pipelines locally on your laptop, --max-concurrent-tasks 1 limits the number of concurrent tasks to test-run on a laptop. Uncomment if running on a workstation/HPC.

# check if Docker works on your machine
$ docker run ubuntu:latest echo hello

# --max-concurrent-tasks 1 is for computers with limited resources
$ caper run share-seq.wdl -i "${INPUT_JSON}" --docker --max-concurrent-tasks 1

Otherwise, install Singularity on your system. Please follow these instructions to install Singularity on a Debian-based OS. Or ask your system administrator to install Singularity on your HPC.

# check if Singularity works on your machine
$ singularity exec docker://ubuntu:latest echo hello

# on your local machine (--max-concurrent-tasks 1 is for computers with limited resources)
$ caper run share-seq.wdl -i "${INPUT_JSON}" --singularity --max-concurrent-tasks 1

# on HPC, make sure that Caper's conf ~/.caper/default.conf is correctly configured to work with your HPC
# the following command will submit Caper as a leader job to SLURM with Singularity
$ caper hpc submit share-seq.wdl -i "${INPUT_JSON}" --singularity --leader-job-name ANY_GOOD_LEADER_JOB_NAME

# check job ID and status of your leader jobs
$ caper hpc list

# cancel the leader node to close all of its children jobs
# If you directly use cluster command like scancel or qdel then
# child jobs will not be terminated
$ caper hpc abort [JOB_ID]

Input JSON file

IMPORTANT: DO NOT BLINDLY USE A TEMPLATE/EXAMPLE INPUT JSON. READ THROUGH THE FOLLOWING GUIDE TO MAKE A CORRECT INPUT JSON FILE.

An input JSON file specifies all of the input parameters and files that are necessary for successfully running this pipeline. This includes a specification of the path to the genome reference files and the raw data FASTQ files. Please make sure to specify absolute paths rather than relative paths in your input JSON files.

Running on Terra/Anvil (using Dockstore)

Visit our pipeline repo on Dockstore. Click on Terra or Anvil. Follow Terra's instructions to create a workspace on Terra and add Terra's billing bot to your Google Cloud account.

Download this [test input JSON for Terra](we don't have one at the moment), upload it to Terra's UI, and then run the analysis.

If you would like to use your own input JSON file, make sure that all files in the input JSON are on a Google Cloud Storage bucket (gs://). URLs will not work.

How to organize outputs

Install Croo. Make sure that you have python3(> 3.4.1) installed on your system. Find a metadata.json on Caper's output directory.

$ pip install croo
$ croo [METADATA_JSON_FILE]

How to make a spreadsheet of QC metrics

Install qc2tsv. Make sure that you have Python 3 (>3.4.1) installed on your system.

Once you have organized the output with Croo, you will be able to find the pipeline's final output file qc/qc.json which contains all the QC metrics. Simply feed qc2tsv with multiple qc.json files. It can take various URIs such as local paths, gs://, and s3://.

$ pip install qc2tsv
$ qc2tsv /sample1/qc.json gs://sample2/qc.json s3://sample3/qc.json ... > spreadsheet.tsv

QC metrics for each experiment (qc.json) will be split into multiple rows (1 for overall experiment + 1 for each bio replicate) in a spreadsheet.

TODO:\ Sambamba\ add track generation \

Thank you to the ENCODE DAC for writing excellent documentation for their pipelines that we used as templates.

epi-share-seq-pipeline's People

Contributors

Stargazers

Watchers

Forkers

lzj1769 ruochiz

epi-share-seq-pipeline's Issues

clump removal/doublets

Extra task to localize inputs for not Terra usage

demux by lane

Make html beautiful with tabs

Dropulation

Pass the --soloFeatures in input

Update knee/elbow plots to be prettier

Version 5 Seurat

Requester pay google bucket

Fix the ATAC log plot

The striping!!!

Color genes by nUMI plot by percent mito

pretty plate QC

We should incorporate Surya's code for the plate QC into demux.

Add commas in html report

Make statistics more readable by adding commas.

Joint QC plot

-- Density? Jitter?
-- Add to html report

Plot all quadrants of density plot

Matrices merging

Using method in Juliana's paper identify variable genes

Count matrix with gene names wrong

Cell type annotation

Create seqspec for the fastq that we will generate

FastqQC in the pipeline?

UMAP percent mito colored same color scheme as nUMI and nGene

Additionally, consider setting max at 99th percentile or 97th percentile

Empty pkr for ATAC remove _

DORCs refactoring

*Choose inputs
*Get away form R
*Parallelization

Histogram plot from pdf to png

run ArchR with lenient and strict parameters

Run the ArchR workflow twice with two sets of parameters to get lenient (default) and strict QC filters applied plots.

GX vs gx tag in STARsolo

https://github.com/alexdobin/STAR/blob/56c9fd59d0cbddd630f9ed2a656dd3a963a1b6b4/CHANGES.md?plain=1#L24

ddqc

Make new raw fastqs

Change the fastqs to match the IGVF specs

bigger labels on seurat plots

Change MITO fraction

no way to force exit jupyter nb execution via papermill

Cannot find a way to force exit jupyter notebook running IRkernel via papermill. This mainly affects the DORCs correlation function, which goes into an infinite loop if there are no common barcodes between RNA and ATAC matrices.

papermill ignores R system quit() , stop() commands

Seurat log transform violin plots

Labels bigger on joint cell calling plot

Ambient RNA

Mei checked cellbender and it seems that it did not affect the results. Check with Jason to learn more about this

Multimappers flags. What is the best way to handle them?

What GTF ?

Make html even more beautiful with interactive plotting

Remove dots from violin plot

QC metrics in demux

cell barcode and homopolymer
infer_barcodes from ChIP-seq updated

Seurat ATAC & RNA UMAP integration

no-align flag

New IGVF transcriptome reference

https://www.synapse.org/#!Synapse:syn39048501

Gencode v43 for human and M32 for mouse

WorkflowManagerActor: Workflow f5905e43-a5fe-423c-be74-3afbfc0d5540 failed

When I running dockstore, I got this error message

[info] WorkflowManagerActor: Workflow f5905e43-a5fe-423c-be74-3afbfc0d5540 failed (during MaterializingWorkflowDescriptorState): cromwell.engine.workflow.lifecycle.materialization.Mater
ializeWorkflowDescriptorActor$$anon$1: Workflow input processing failed:
Failed to evaluate input 'read2_atac' (reason 1 of 1): No coercion defined from '"/home/younso/dockstore_test/cromwell-input/23968040-60bc-49a0-81a3-be2416675905/home/younso/scRNA/share-raw/sp.atac.R1.fastq.gz"
' of type 'spray.json.JsString' to 'Array[File]'.
Failed to evaluate input 'read1_rna' (reason 1 of 1): No coercion defined from '"/home/younso/dockstore_test/cromwell-input/23968040-60bc-49a0-81a3-be2416675905/home/younso/scRNA/share-raw/sp.rna.R1.fastq.gz"'
of type 'spray.json.JsString' to 'Array[File]'.
Failed to evaluate input 'save_plots_to_dir' (reason 1 of 1): No coercion defined from 'true' of type 'spray.json.JsTrue$' to 'String'.
Failed to evaluate input 'read1_atac' (reason 1 of 1): No coercion defined from '"/home/younso/dockstore_test/cromwell-input/23968040-60bc-49a0-81a3-be2416675905/home/younso/scRNA/share-raw/sp.atac.R2.fastq.gz"
' of type 'spray.json.JsString' to 'Array[File]'.
at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.cromwell$engine$workflow$lifecycle$materialization$MaterializeWorkflowDescriptorActor$$workflowInitializationFail
ed(MaterializeWorkflowDescriptorActor.scala:257)
at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anonfun$2.applyOrElse(MaterializeWorkflowDescriptorActor.scala:227)
at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anonfun$2.applyOrElse(MaterializeWorkflowDescriptorActor.scala:222)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at akka.actor.FSM.processEvent(FSM.scala:707)
at akka.actor.FSM.processEvent$(FSM.scala:704)
at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.akka$actor$LoggingFSM$$super$processEvent(MaterializeWorkflowDescriptorActor.scala:169)
at akka.actor.LoggingFSM.processEvent(FSM.scala:847)
at akka.actor.LoggingFSM.processEvent$(FSM.scala:829)
at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.processEvent(MaterializeWorkflowDescriptorActor.scala:169)
at akka.actor.FSM.akka$actor$FSM$$processMsg(FSM.scala:701)
at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:695)
at akka.actor.Actor.aroundReceive(Actor.scala:539)
at akka.actor.Actor.aroundReceive$(Actor.scala:537)
at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.aroundReceive(MaterializeWorkflowDescriptorActor.scala:169)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:614)
at akka.actor.ActorCell.invoke(ActorCell.scala:583)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
at akka.dispatch.Mailbox.run(Mailbox.scala:229)
at akka.dispatch.Mailbox.exec(Mailbox.scala:241)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

my command is below

dockstore workflow launch --entry github.com/broadinstitute/epi-SHARE-seq-pipeline/SHARE-seq:release --json test_Dockstore.json

and the information of test_Dockstore.json are

{
"ShareSeq.gtf": "/home/younso/dockstore_test/hg19.genes.gtf",
"ShareSeq.remove_single_umi": false,
"ShareSeq.docker_image_dorcs": "/home/younso/bin/epi-SHARE-seq-pipeline/dockerfiles",
"ShareSeq.maxFeature_RNA": "2500",
"ShareSeq.read1_rna": "/home/younso/scRNA/share-raw/sp.rna.R1.fastq.gz",
"ShareSeq.read2_rna": "/home/younso/scRNA/share-raw/sp.rna.R2.fastq.gz",
"ShareSeq.corrPVal": "0.05",
"ShareSeq.cpus_atac": "20",
"ShareSeq.dorcs.chunkSize": "50000",
"ShareSeq.joint.min_genes": "200",
"ShareSeq.cpus_dorcs": "20",
"ShareSeq.read2_atac": "/home/younso/scRNA/share-raw/sp.atac.R1.fastq.gz",
"ShareSeq.cpus": "20",
"ShareSeq.topNGene": "30",
"ShareSeq.minCells_RNA": "3",
"ShareSeq.idx_tar_atac": "/home/younso/dockstore_test/hg19.tar.gz",
"ShareSeq.joint.min_umis": "50",
"ShareSeq.include_multimappers": false,
"ShareSeq.rna.umap_dim": "10",
"ShareSeq.joint.mem_gb": "64",
"ShareSeq.cutoff_rna": "100",
"ShareSeq.minFeature_RNA": "200",
"ShareSeq.multimappers": false,
"ShareSeq.qc": false,
"ShareSeq.idx_tar_rna": "/home/younso/dockstore_test/hg19.tar.gz",
"ShareSeq.joint.min_frags": "100",
"ShareSeq.genome_name_input": "hg19",
"ShareSeq.percentMT_RNA": "5",
"ShareSeq.dorcs.numBackgroundPairs": "100000",
"ShareSeq.prefix": "shareseq-project",
"ShareSeq.save_plots_to_dir": TRUE,
"ShareSeq.joint.disk_gb": "50",
"ShareSeq.rna.umap_resolution": "0.5",
"ShareSeq.windowPadSize": "50000",
"ShareSeq.dorcs.disk_gb": "100",
"ShareSeq.fripCutOff": "0.3",
"ShareSeq.cutoff_atac": "100",
"ShareSeq.mem_gb_dorcs": "64",
"ShareSeq.dorcs.numNearestNeighbor": "30",
"ShareSeq.joint.umi_metrics_cutoff": "10",
"ShareSeq.dorcGeneCutOff": "10",
"ShareSeq.mode": "fast",
"ShareSeq.read1_atac": "/home/younso/scRNA/share-raw/sp.atac.R2.fastq.gz",
"ShareSeq.cpus_rna": "20",
"ShareSeq.joint.min_tss": "4",
"ShareSeq.include_introns": "true",
"ShareSeq.gene_naming": "gene_name"
}

In test_Dockstore.json, there are almost contents of Dockstore.json except some features that I didn't know what should I fill
As I inferred, it seems that more than two paths should be entered in ShareSeq.read1_rna, but the example file contains only one path and there is only one pair of files.
Do I have to put both reads in read1 without caring about read1 and read2?

Joint QC analysis

Ability to keep cells that pass either RNA or ATAC filtering