Pipeline for Reference based Transcriptomics.
Please download PiReT from the github.
git clone https://github.com/mshakya/PyPiReT.git
cd
into the PyPiReT
directory
cd PyPiReT
./INSTALL.SH
PiReT uses bioinformatic tools, many of which are available in bioconda. For installing PiReT
we have provided a script INSTALL.sh
that checks for required dependencies (including their versions) are installed and in your path, and installs it in directories within PiReT
if not found. Additionally, sudo
privileges are not needed for installation. A log of all installation can be found in install.log
PiReT requires following dependencies, all of which should be installed and in the PATH. All of the dependencies will be installed by INSTALL.sh
.
- Python >=v2.7.12
- The pipeline is not compatible with Python v3.0 or higher.
- R >=v3.3.1
- conda v4.2.13
If conda is not installed,
INSTALL.sh
will download and install miniconda, a "mini" version ofconda
that only installs handful of packages compared to anaconda - cpanm v1.7039, for installing perl packages.
- jellyfish (v2.2.6)
- samtools (v1.3.1)
- HiSat2 (v2.0.5)
- gffread (v0.9.6)
- featurecount (v1.5.2)
- stringTie (v1.3.3b)
We have provided test data set to check if the installation was successful or not. fastq
files can be found in tests/fastqs
and corresponding reference fasta files are found in tests/data
. To run the test, from within PyPiReT
directory:
# if you are in a LINUX system:
sh ./test_pipeline_linux.sh
These shell script automatically creates test_experimental_design.txt
and runs the pipeline.
usage: runPiReT [-h] [-c CPU] -d WORKDIR -i INDEX_HISAT -e EXPDSN
[-fp FASTA_PROK] [-gp GFF_PROK] [-fe FASTA_EUK] [-ge GFF_EUK]
[-k {prokarya,eukarya,both}]
[-m {EdgeR,Deseq2,ballgown,DeEdge,Degown,ballEdge,all}]
[-p P_VALUE] [--scheduler] [--qsub]
Luigi based workflow for running RNASeq pipeline
optional arguments:
-h, --help show this help message and exit
-c CPU number of CPUs/threads to run per task. Here, task
refers to a processing step. For example, number of
CPUs specified here will be used for QC, HISAT index
and mapping steps. Since QC and mapping steps are run
for every sample, be aware that the total number of
CPUs needed are your number of samples times CPU
specified here. (default: 1)
-k {prokarya,eukarya,both}
which kingdom to test, when eukarya or both is chosen,
it expects alternative splicing (default: prokarya)
-m {EdgeR,Deseq2,ballgown,DeEdge,Degown,ballEdge,all}
Method to use for detecting differentially expressed
genes, Deseq2 requires 3 biological replicates and
ballgown only works for eukaryotes (default: ballEdge)
-p P_VALUE P-Value to for finding significantly
different, default is 0.001 (default: 0.001)
--scheduler when specified, will use luigi scheduler which allows
you to keep track of task using an url specified
through luigid (default: True)
--qsub run the SGE version of the code, it currently is set
to SGE with smp (default: False)
required arguments:
-d WORKDIR working directory where all output files will be
processed and written (default: None)
-i INDEX_HISAT hisat2 index file, it only creates index if it does
not exist
-e EXPDSN tab delimited experimental design file
required arguments (for prokaryotes):
-fp FASTA_PROK fasta for Prokaryotic Reference (default: None)
-gp GFF_PROK path to gff files for prokaryotic organism (default:
)
required arguments (for eukaryotes):
-fe FASTA_EUK fasta for Eukaryotic Reference (default: None)
-ge GFF_EUK path to gff files for eukaryotic organism (default: )
when selecting both kingodm runs, options that are required for both eukaryotes
and prokaryotes run are required.
Example run for Prokaryotes RNA seq:
runPiReT -d <workdir> -e <design file> -gp <gff> -i <hisat2 index>
-k prokarya -m <EdgeR/Deseq2> -fp <FASTA>
Example run for Eukaryotes RNA seq:
runPiReT -d <workdir> -e <design file> -ge <gff> -i <hisat2 index>
-k eukarya -m <EdgeR/Deseq2> -fe <FASTA>
Example run for Both (Eukaryotes and Prokaryotes) RNA seq:
runPiReT -d <workdir> -e <design file> -gp <gff> -ge <gff> -i <hisat2 index>
-k both -m <EdgeR/Deseq2> -fe <FASTA> -fp <FASTA>
An experimental design file consist of sample name (ID), full path to fastq files (Files), and different groups of your samples (Group). We recommend that you use a text editor like BBedit or TextWrangler to generate the tab delimited experimental design file. Exporting a tab delimited file directly from Excel tend to cause formatting problem. If possible, please avoid any special characters in sample names and group names.
For example:
samp1, samp_1 : good name
samp 1, samp.1: not a good name and will likely cause errors.
A sample of experimental design file can be found here.
All the outputs will be within the working directory
.
-
samp2
: The name of this directory corresponds to sample name. Within this folder there are two sub-folders:-
mapping_results
This folder contains reads mapped using HISAT2 in following formats. Ifsplice_sites_gff.txt
is present, HISAT2 aligns based on known splice sites.*.sam
: outputs of HISAT2*.bam
: generated from.sam
- mapping.log: Alignment summary file from
HISAT2
. - *sTie.tab: Tab delimited file with Coverage, FPKM, TPM, for all the genes and novel transcripts. Generated using string tie.
- *sTie.gtf: Primay GTF formatted output of stringtie.
-
trimming_results
This folder contains results of quality trimming and filtering using FaQC.- *_qc_report.pdf: A QC report file with figures.
- fastqCount.txt: A text file with summary of read counts.
- *trimmed.fastq: Pair of trimmed fastq files.
- *unpaired.trimmed.fastq: fastq that did not have pairs after QC.
- *.stats.txt: Summary file with numbers of reads before and after QC.
-
-
ballgown
ballgown
folder. The folder is to be read byR
packageballgown
for finding significantly expressed genes. -
*merged_transcript.gtf
: Non-redundant list of transcripts in GTF format merged from all samples. -
featureCounts
: A folder containing tables of counts fromfeatureCounts
.- CDS.count:Reads mapped to regions annotated as CDS.
- CDS.count.summary: Summary of reads mapped and unmapped to CDS.
- exon.count
- exon.count.summary
- prok_CDS.count : When used
both
option, prokaryote counts are in this file. Eukaryotes are found in file namedeuk_CDS.count
- prok_CDS.count.summary: Corresponding summary file.
For removal, since all dependencies that are not in your system are installed in PiReT
, delete (rm -rf
) PiReT
folder is sufficient to uninstall the package. Before removing check if your project files are within PiReT
directory.
- Migun Shakya
- Shihai Feng
If you use PiReT please cite following papers:
- samtools: Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]
- bowtie2: Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357-359. [PMID: 22388286]
- bwa: Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168]
- DESeq2: Love MI, Huber W and Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, pp. 550. [PMID: 25516281]
- EdgeR: McCarthy, J. D, Chen, Yunshun, Smyth and K. G (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), pp. -9. [PMID: 22287627]
- HTSeq: Anders, S., Pyl, P. T., & Huber, W. (2014). HTSeq–a Python framework to work with high-throughput sequencing data. Bioinformatics. [PMID: 25260700]
- HISAT2: Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: a fast spliced aligner with low memory requirements. Nature methods, 12(4), 357-360. [PMID: 25751142]
- BEDTools: Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842. [PMID: 20110278]
- GAGE: Luo, Weijun, Michael S. Friedman, Kerby Shedden, Kurt D. Hankenson, and Peter J. Woolf. 2009. “GAGE: Generally Applicable Gene Set Enrichment for Pathway Analysis.” BMC Bioinformatics 10 (May): 161.
- Pathview: Luo, Weijun, and Cory Brouwer. 2013. “Pathview: An R/Bioconductor Package for Pathway-Based Data Integration and Visualization.” Bioinformatics 29 (14). Oxford University Press: 1830–31.
- Ballgown: Frazee, Alyssa C., Geo Pertea, Andrew E. Jaffe, Ben Langmead, Steven L. Salzberg, and Jeffrey T. Leek. 2015. “Ballgown Bridges the Gap between Transcriptome Assembly and Expression Analysis.” Nature Biotechnology 33 (3): 243–46.
- featureCounts: Liao, Yang, Gordon K. Smyth, and Wei Shi. 2014. “featureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic Features.” Bioinformatics 30 (7): 923–30.
- StringTie: Pertea, Mihaela, Geo M. Pertea, Corina M. Antonescu, Tsung-Cheng Chang, Joshua T. Mendell, and Steven L. Salzberg. 2015. “StringTie Enables Improved Reconstruction of a Transcriptome from RNA-Seq Reads.” Nature Biotechnology 33 (3): 290–95.