Giter Site home page Giter Site logo

ccbr / pipeliner Goto Github PK

View Code? Open in Web Editor NEW
4.0 11.0 0.0 20.72 MB

An open-source and scalable solution to NGS analysis powered by the NIH's Biowulf cluster.

Python 75.43% Perl 5.24% JavaScript 1.89% Shell 0.55% CSS 0.38% R 13.91% HTML 1.81% Roff 0.76% Awk 0.03%
rna-seq whole-genome-sequencing whole-exome-sequencing single-cell-rna-seq mirna-seq quality-control chip-seq

pipeliner's Introduction

Welcome to Pipeliner - an open-source and scalable solution to NGS analysis powered by the NIH's Biowulf cluster.

Pipeliner provides access to a set of best-practices NGS pipelines developed, tested, and benchmarked by experts at CCBR and NCBR.

Questions or need help?

Please check out our FAQ or contact page for different ways of getting in touch with the team.

Pipeliner is CHANGING!!!

Please visit CCBR github page for new features and release schedule.

pipeliner's People

Contributors

abdallahamr avatar ajeetmandal avatar dlwheeler avatar felloumi avatar jlac avatar joshuabhk avatar kopardev avatar mtandon09 avatar nikhilbg avatar pajailwala avatar skchronicles avatar slsevilla avatar tovahmarkowitz avatar wong-nw avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pipeliner's Issues

peak annotations

clean up peak annotation outputs
add uropa to peak annotation options
remove ChIPseeker (?)

Ability to Dynamically Resolve CCR BUY-NODES

By default jobs are being submitted to the ccr,norm partitions (even if a user does not have access to the ccr buy-in nodes), this causes a bit of extra work for the job scheduler, and it also adds to
logging noise. HPC staff is also suspending accounts of unauthorized users who are submitting jobs to both partitions.

To dynamically determine a user's buy-in information, we need to map their account sponsor's information to partition access info.

To find a out a users sponsor, we can query the slurm account management database using sacctmagr. The following command will return the sponsor's name:
sacctmgr -rn list user | awk '{print $2}'

We can then map this information to known AllowedAccount partition information to determine which buy-in nodes they have access to:
BUY_IN_NODES=$(ACCOUNT_SPONSOR=$(sacctmgr -rn list user | awk '{print $2}') && scontrol show partitions | grep -i $ACCOUNT_SPONSOR -B1 | grep '^PartitionName' | cut -d '=' -f2 | grep -iv 'gpu'| tr '\n' ',' | sed 's/.$//')

fix QC issues with dedups

Current version of QC portion of pipeline designed to run to completion with Picard deduplication outputs.
Many QC metrics do not work with MACS2 deduplication.
Features that work with both: fingerprint plots, PCA, correlation heatmaps
Features that are MACS2 deduplication specific: NGSQC plots

miRSeq - miRDeep2

Features:
QC: cutadapt adapter trimming; fastqc pre- and post-adapter trimming; kraken; fastqscreen (standard seqs and short RNAs); multiqc
Alignment with mirdeep2: mapping/collapsing; 2 pass alignment (annotated and de novo)
Quantification and DEG: edgeR

Create test datasets

  • Create SE and PE datasets for
    • mm10
    • mm9
    • hg19
    • hg38
  • Decide on location of saving these test datasets

Documentation

make wiki documentation on usage of the CHIP-seq QC pipeline

handle replicates

make chipseq.snakefile handle replicates:
only applicable features when there are replicates, and only on the subset of files that actually have replicates

PCA generation in QC step.

PCA is currently generated in the second part of the pipeline. It should be moved over to the initialQC part.

GSEA

  • Run GSEA for all contrasts against:
    • c2:cp
    • c5
  • heatmap with significance (asterisk)
  • Add leading edge analysis
  • Remove L2P from LimmaReport, EdgerReport, DESeqReport

Covariates

  • Add new columns to groups folder to denote batch.

  • Modify DEG calculations and Rmd reports

RSEM CPM filtering (Master and module load)

We are currently transforming cpm filter, and by doing so, we are making making the filtering much less stringent.
val1 = 0.5
val1=(val1/max(tot))*1e6
filter <- apply(cpm(mydata), 1, function(x) length(x[x>val1])>=val2)

val1
[1] 0.008896055

GUI Warning Message: Whole Exome/Genome Pipelines

GUI Warning Message: Whole Exome/Genome Pipelines

mm9 is not supported by the whole genome/exome pipelines. If you look at the mm9.json file, it is actually using mm10 reference files:

{
    "references": {
        "exomeseq": {
            "BWAGENOME": "/fdb/igenomes/Mus_musculus/UCSC/mm10/Sequence/BWAIndex/genome.fa",
            "PON": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_dbSNP_allStrains_compSet.vcf.gz",
            "GENOME": "/fdb/igenomes/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa",
            "CHROMS": ["chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19","chrX","chrY","chrM"],
            "INDELSITES": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_indels.vcf.gz",
            "KNOWNINDELS": "-known /data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_indels.vcf.gz",
            "KNOWNRECAL": "-knownSites /data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_indels.vcf.gz -knownSites /data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_snps.vcf.gz",
            "KNOWNANCESTRY": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_snps.vcf.gz",
            "KNOWNVCF": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_snps.vcf.gz",
            "KNOWNVCF2": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_snps.vcf.gz",
            "NOVOINDEX": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_genome.nix",


"genomeseq": {
            "BWAGENOME": "/fdb/igenomes/Mus_musculus/UCSC/mm10/Sequence/BWAIndex/genome.fa",
            "GENOME": "/fdb/igenomes/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa",
            "CHROMS": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_chromosomes",
            "INDELSITES": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_indels.vcf.gz",
            "KNOWNINDELS": "-known /data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_indels.vcf.gz",
            "KNOWNRECAL": "-knownSites /data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_indels.vcf.gz -knownSites /data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_snps.vcf.gz",
            "KNOWNANCESTRY": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_snps.vcf.gz",
            "KNOWNVCF": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_snps.vcf.gz",
            "KNOWNVCF2": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_known_snps.vcf.gz",
            "NOVOINDEX": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_genome.nix",
            "CNVKITGENOME": "/data/CCBR_Pipeliner/db/PipeDB/lib/genome.fa",
            "SNPSITES": "/data/CCBR_Pipeliner/db/PipeDB/lib/mm10_allstrains_dbSNP142.vcf",

Solution: Adding a pop-up box to warning users if they select the wrong reference genome.

Redesign DEG folders

  • Need to add an extra column to "contrasts.tab", which will serve as an CPM cutoff threshold for the contrast represented on that line in the file.

  • all sample DEG folder separate (generated during initialQC)

  • DEG_RSEM_genes__<CPM_threshold> as folder name of each contrast.
    --filter each contrasts separtely
    --create separate PCA of contrast-only groups

fix macs2 peak calling

Improve MACS2 peak calling to:
a) use the ppqt cross-correlation values for read extension for SE data
b) use PE data from bam files when applicable

improve fingerprint plot

Current fingerprint plots only produce a subset of available metrics. Pairing of ChIP and input samples will increase the information gained from these plots.

fix bam headers

current Q5DD bam files header has sampleID=sample
picard reheading

HPC snakemake upgrade incompatibility

Snakemake Version Incompatibility: REVERT 5.1.3

Changes were made to module load ccbrpipeliner, master and activeDev branch

HPC upgraded the version of snakemake that is loaded when running module load snakemake/3.5.
Previously, snakemake version 5.1.3 was loaded. Now, snakemake version 5.3.0 is loaded.
This is problematic due to the base command we are using. We are using the -T switch which is no longer supported in the newer version.

Adding explicit module load snakemake/5.1.3 statements after module loading python 3.5 to the following files.

  • submit_slurm.template
  • pipeline_ctrl.sh
  • slurm.template
  • ccbrpipe.sh

fingerprint plot and multiQC

find workaround for the fact that multiQC currently finds two fingerprint plot files per sample (prededup and postdedup) and randomly choses one

Create test datasets

  • Create SE and PE datasets for
    • mm10
    • mm9
    • hg19
    • hg38
  • Decide on location of saving these test datasets

QC table for ChIPSeq

Gather QC metrics into a single html report:

  • Nreads
  • Nmapped
  • Nuniquelymapped
  • NRF
  • PBC1
  • PBC2
  • FRiP
  • NGSQC 3 numbers
  • NSC
  • RSC
  • Qtag

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.