ccbr / pipeliner Goto Github PK

An open-source and scalable solution to NGS analysis powered by the NIH's Biowulf cluster.

Python 75.43% Perl 5.24% JavaScript 1.89% Shell 0.55% CSS 0.38% R 13.91% HTML 1.81% Roff 0.76% Awk 0.03%

rna-seq whole-genome-sequencing whole-exome-sequencing single-cell-rna-seq mirna-seq quality-control chip-seq

pipeliner's Introduction

Welcome to Pipeliner - an open-source and scalable solution to NGS analysis powered by the NIH's Biowulf cluster.

Pipeliner provides access to a set of best-practices NGS pipelines developed, tested, and benchmarked by experts at CCBR and NCBR.

Questions or need help?

Please check out our FAQ or contact page for different ways of getting in touch with the team.

Pipeliner is CHANGING!!!

Please visit CCBR github page for new features and release schedule.

pipeliner's People

Contributors

Stargazers

Watchers

pipeliner's Issues

peak annotations

clean up peak annotation outputs
add uropa to peak annotation options
remove ChIPseeker (?)

Annotation file

Annotate peaks with HOMER and save to text file.

GEO Upload Documentation

Add documentation to iMeet Central detailing how up upload data to GEO

RNA-seq Pipeline: Add Limma Voom Normalized Matrix with EBSeq Output

Note to self: Rscript located in /data/kuhnsa/

Ability to Dynamically Resolve CCR BUY-NODES

By default jobs are being submitted to the ccr,norm partitions (even if a user does not have access to the ccr buy-in nodes), this causes a bit of extra work for the job scheduler, and it also adds to
logging noise. HPC staff is also suspending accounts of unauthorized users who are submitting jobs to both partitions.

To dynamically determine a user's buy-in information, we need to map their account sponsor's information to partition access info.

To find a out a users sponsor, we can query the slurm account management database using sacctmagr. The following command will return the sponsor's name:
sacctmgr -rn list user | awk '{print $2}'

We can then map this information to known AllowedAccount partition information to determine which buy-in nodes they have access to:
BUY_IN_NODES=$(ACCOUNT_SPONSOR=$(sacctmgr -rn list user | awk '{print $2}') && scontrol show partitions | grep -i $ACCOUNT_SPONSOR -B1 | grep '^PartitionName' | cut -d '=' -f2 | grep -iv 'gpu'| tr '\n' ',' | sed 's/.$//')

fix QC issues with dedups

Current version of QC portion of pipeline designed to run to completion with Picard deduplication outputs.
Many QC metrics do not work with MACS2 deduplication.
Features that work with both: fingerprint plots, PCA, correlation heatmaps
Features that are MACS2 deduplication specific: NGSQC plots

cutadapt

change version to 1.8

miRSeq - miRDeep2

Features:
QC: cutadapt adapter trimming; fastqc pre- and post-adapter trimming; kraken; fastqscreen (standard seqs and short RNAs); multiqc
Alignment with mirdeep2: mapping/collapsing; 2 pass alignment (annotated and de novo)
Quantification and DEG: edgeR

clean up chipseq.snakefile

make the snakefile easier to follow

Adding Subpipeline for Exome and Genomeseq

Removing old, out-dated documentation and updating github page links

Removing:
PipelinerVer1.0_documentation.pdf

Updating links on Pipeliner's github page:
New links will point to our wiki documentation

Fix issue with labels.txt and peakcall.tab

Current version of pipeline only works in automated fashion when rawdata folder contains labels.txt

Create test datasets

Create SE and PE datasets for
- mm10
- mm9
- hg19
- hg38
Decide on location of saving these test datasets

Documentation

make wiki documentation on usage of the CHIP-seq QC pipeline

handle replicates

make chipseq.snakefile handle replicates:
only applicable features when there are replicates, and only on the subset of files that actually have replicates

Generate reference FAI files for MEMEchip

PCA generation in QC step.

PCA is currently generated in the second part of the pipeline. It should be moved over to the initialQC part.

GSEA

Run GSEA for all contrasts against:
- c2:cp
- c5
heatmap with significance (asterisk)
Add leading edge analysis
Remove L2P from LimmaReport, EdgerReport, DESeqReport

scrna-next steps

Enhancing/refining Pipeliner documentation

Excel-friendly DEG Result

ChIP-seq Multiqc Report: Samtools Flagstat and Fastq Screen Error

Multiqc Module Errors

Samtools Flagstat Module:

Not all samples are being shown

Fastq Screen Module:

Fastq Screen is incorrectly showing bacteria contamination

Covariates

Add new columns to groups folder to denote batch.
Modify DEG calculations and Rmd reports

Add SV calling tools

Integrate Sequenza and FREEC into WES pipeline for Pipeliner

Deeptools heatmap

scrna implemented functions

Add ngsQC metrics

Could you add a ribosomal RNA content information

So that we can see in the Multiqc table.

RSEM CPM filtering (Master and module load)

We are currently transforming cpm filter, and by doing so, we are making making the filtering much less stringent.
val1 = 0.5
val1=(val1/max(tot))*1e6
filter <- apply(cpm(mydata), 1, function(x) length(x[x>val1])>=val2)

val1
[1] 0.008896055

GUI Warning Message: Whole Exome/Genome Pipelines

mm9 is not supported by the whole genome/exome pipelines. If you look at the mm9.json file, it is actually using mm10 reference files:

{
    "references": {
        "exomeseq": {
            "BWAGENOME": "/fdb/igenomes/Mus_musculus/UCSC/mm10/Sequence/BWAIndex/genome.fa",
            "PON": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_dbSNP_allStrains_compSet.vcf.gz",
            "GENOME": "/fdb/igenomes/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa",
            "CHROMS": ["chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19","chrX","chrY","chrM"],
            "INDELSITES": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_indels.vcf.gz",
            "KNOWNINDELS": "-known /data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_indels.vcf.gz",
            "KNOWNRECAL": "-knownSites /data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_indels.vcf.gz -knownSites /data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_snps.vcf.gz",
            "KNOWNANCESTRY": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_snps.vcf.gz",
            "KNOWNVCF": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_snps.vcf.gz",
            "KNOWNVCF2": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_snps.vcf.gz",
            "NOVOINDEX": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_genome.nix",


"genomeseq": {
            "BWAGENOME": "/fdb/igenomes/Mus_musculus/UCSC/mm10/Sequence/BWAIndex/genome.fa",
            "GENOME": "/fdb/igenomes/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa",
            "CHROMS": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_chromosomes",
            "INDELSITES": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_indels.vcf.gz",
            "KNOWNINDELS": "-known /data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_indels.vcf.gz",
            "KNOWNRECAL": "-knownSites /data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_indels.vcf.gz -knownSites /data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_snps.vcf.gz",
            "KNOWNANCESTRY": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_snps.vcf.gz",
            "KNOWNVCF": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_snps.vcf.gz",
            "KNOWNVCF2": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_snps.vcf.gz",
            "NOVOINDEX": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_genome.nix",
            "CNVKITGENOME": "/data/CCBR_Pipeliner/db/PipeDB/lib/genome.fa",
            "SNPSITES": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_allstrains_dbSNP142.vcf",

Solution: Adding a pop-up box to warning users if they select the wrong reference genome.

Redesign DEG folders

Need to add an extra column to "contrasts.tab", which will serve as an CPM cutoff threshold for the contrast represented on that line in the file.
all sample DEG folder separate (generated during initialQC)
DEG_RSEM_genes__<CPM_threshold> as folder name of each contrast.
--filter each contrasts separtely
--create separate PCA of contrast-only groups

fix macs2 peak calling

Improve MACS2 peak calling to:
a) use the ppqt cross-correlation values for read extension for SE data
b) use PE data from bam files when applicable

improve fingerprint plot

Current fingerprint plots only produce a subset of available metrics. Pairing of ChIP and input samples will increase the information gained from these plots.

Deeptools filenames resolving incorrectly (ChIP-seq)

Fingerplot

Adding MATS to RNA-Seq pipeliner

Fathi, Vishal:
Lets plan to add MATS for detecting intron retention and other alternative splicing events from RNA-seq data:

http://rnaseq-mats.sourceforge.net/user_guide.htm

fix bam headers

current Q5DD bam files header has sampleID=sample
picard reheading

HPC snakemake upgrade incompatibility

Snakemake Version Incompatibility: REVERT 5.1.3

Changes were made to `module load ccbrpipeliner`, master and activeDev branch

HPC upgraded the version of snakemake that is loaded when running module load snakemake/3.5.
Previously, snakemake version 5.1.3 was loaded. Now, snakemake version 5.3.0 is loaded.
This is problematic due to the base command we are using. We are using the -T switch which is no longer supported in the newer version.

Adding explicit module load snakemake/5.1.3 statements after module loading python 3.5 to the following files.

submit_slurm.template
pipeline_ctrl.sh
slurm.template
ccbrpipe.sh

Differential ChIPSeq exploration

Explore DiffChIPSeq tools:
-- macs2
-- DiffBind
-- Counts-->normalize-->DeSeq2/edgeR

strand specific bigwigs
normalize?

non gui snakemake run