Giter Site home page Giter Site logo

manninm / miktmcsnakemakepipeline Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 0.0 8.75 MB

Attempt at snakemake pipeline. Pyflow was forked from https://github.com/crazyhottom but the Snakefile infrastructure and rule calling was inspired by https://github.com/snakemake-workflows

License: MIT License

Shell 0.78% Python 47.60% R 51.63%
htseq-count snakemake fastqs star pipeline ballgown stringtie slurm-cluster

miktmcsnakemakepipeline's Introduction

citation

Much of this pipeline was inspired by https://github.com/snakemake-workflows and https://github.com/crazyhottomy. The fastq2jason.py script was modified from the original by https://github.com/crazyhottomy, but the Snakefile and modularized rules were inspired by https://github.com/snakemake-workflows. All Files in rules and scripts are my own work. If you use this pipeline, please cite Manninm/MiKTMCSnakemakePipeline

How to use Pipeline

Most of the specifics of the pipeline can be handled in the config.yaml file. The snakefile, rules and cluster.json SHOULD NOT BE EDITED BY HAND. If you absolutley need to edit cluster.json, I recommend https://jsoneditoronline.org/. Snakemake is very sensitive to syntax, and just saving a file in the wrong format can cause problems.

Download the pipeline from Github or transfer the pipeline from my home directory on 76 server

tar -xvf MiKTMCSnakemakePipeline.tar.gz
mv -v MiKTMCSnakemakePipeline/* .
rm -r MiKTMCSnakemakePipeline/

Do dry run to check outputs and rules

snakmake -npr -s Snakefile

Make DAG or Rulegraph

snakemake --forceall --rulegraph -s Snakefile | dot -Tpng > rulegrap.png
snakemake --forceall --rulegraph -s Snakefile | dot -Tpdf > rulegrap.pdf
snakemake --forceall --dag -s Snakefile | dot -Tpng > dag.png
snakemake --forceall --dag -s Snakefile | dot -Tpdf > dag.pdf

Run locally using 22 cores

snakemake -j 22 -s Snakefile

Run on Greatlakes and Slurm FYI, the --flags used in the snakemake command call must be somewhere in cluster.json, wwether under the default heading, or the rule heading. If --tasks-per-node is called in the command call, and only --tasks-per-cpu is in your default/rule heading, snakemake will complain that "Wildcards have no attribute..."

snakemake -j 999 --cluster-config cluster.json --cluster 'sbatch --job-name {cluster.job-name} --ntasks-per-node {cluster.ntasks-per-node} --cpus-per-task {threads} --mem-per-cpu {cluster.mem-per-cpu} --partition {cluster.partition} --time {cluster.time} --mail-user {cluster.mail-user} --mail-type {cluster.mail-type} --error {cluster.error} --output {cluster.output}'

The Workflow of Pipeline

The workflow is as seen below

The pipeline expects a directory format as the below example CAUTION Four or more samples must be included, or the PCA scripts will break. It expects pair-end reads. To my knowledge, the pipeline will not accomodate single-end reads.

RNAseqTutorial/
├── Sample_70160
│   ├── 70160_ATTACTCG-TATAGCCT_S1_L001_R1_001.fastq.gz
│   └── 70160_ATTACTCG-TATAGCCT_S1_L001_R2_001.fastq.gz
├── Sample_70161
│   ├── 70161_TCCGGAGA-ATAGAGGC_S2_L001_R1_001.fastq.gz
│   └── 70161_TCCGGAGA-ATAGAGGC_S2_L001_R2_001.fastq.gz
├── Sample_70162
│   ├── 70162_CGCTCATT-ATAGAGGC_S3_L001_R1_001.fastq.gz
│   └── 70162_CGCTCATT-ATAGAGGC_S3_L001_R2_001.fastq.gz
├── Sample_70166
│   ├── 70166_CTGAAGCT-ATAGAGGC_S7_L001_R1_001.fastq.gz
│   └── 70166_CTGAAGCT-ATAGAGGC_S7_L001_R2_001.fastq.gz
├── scripts
├── groups.txt
└── Snakefile

The pipeline uses two types of annotation and feature calling for redundancy in the event that one pipeline fails/gives 'wonky' results Upon initiating the snakemake file, the snakemake preamble will check fastq file extensions (our lab uses .fq.gz for brevity) and change any fastq.gz to fq.gz. The preamble will then generate a samples.json file using fastq2json.py. You should check samples.json and makesure it is correct because the rest of the pipeline uses this file to create wildcars, which is the driving force behind snakemake. If no groupfile (groups.txt) was provided, the preample will generate one for you. This file is necessary to run ballgown as well as the PCA plots. This should also be checked for errors. If you provide your own groups.txt, it should be in the format below

Directory       Samples Disease Batch
Sample_70160/   Sample_70160    Sample  Batch
Sample_70161/   Sample_70161    Sample  Batch
Sample_70162/   Sample_70162    Sample  Batch
Sample_70166/   Sample_70166    Sample  Batch

The directory and sample names should correspond and be in the order as they appear in the directory. The sample and batch columns can be used to designate phenotype data and any batchs you may have. If you have varying 'Disease' types, you can then use this file for differential expression and use the batch column to correct for batch affects. The PCA plotting scripts will plot Disease types in different colors, and different Batchs with different shapes

I have attempted to make this pipeline as streamlined and automatic as possible. It could incorporate differential expression, but I feel that the pipeline completes sufficient tasks for review before Differetial Analysis. In the even that a cohort has Glom and Tub samples, it would be wise to run each separately in their own pipeline. Adding another child directory would be more difficult to code rules for. If there are any plots, qc tools or metrics that you use in your personal analysis, those can be integrated upon request.

miktmcsnakemakepipeline's People

Contributors

crazyhottommy avatar manninm avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.