lieberinstitute / speaqeasy Goto Github PK

SPEAQeasy: portable LIBD RNA-seq pipeline using Nextflow. Check http://research.libd.org/SPEAQeasy-example/ for an example on how to use this pipeline and analyze the resulting output files.

Home Page: http://lieberinstitute.github.io/SPEAQeasy

License: MIT License

R 0.07% Perl 0.01% Shell 0.03% Makefile 0.01% Groovy 0.03% Python 0.01% HTML 99.82% Dockerfile 0.02% Lua 0.01% Nextflow 0.02%

rstats rnaseq nextflow pipeline hisat2 speaqeasy rna-seq-pipeline

speaqeasy's Introduction

SPEAQeasy- a Scalable Pipeline for Expression Analysis and Quantification that is easy to install and share

Summary

SPEAQeasy is a Scalable RNA-seq Pipeline for Expression Analysis and Quantification based on the RNAseq-pipeline. Built on nextflow, and capable of using Docker containers and utilizing common resource managers (e.g. SLURM), this port of the RNAseq-pipeline can be used in different computer environments. It is described in the manuscript here.

The main function of this pipeline is to produce comparable files to those used in recount2, a tool that provides gene, exon, exon-exon junction and base-pair level data.

This pipeline allows researchers to contribute data to the recount2 project even from outside the JHPCE.

Workflow overview

SPEAQeasy takes raw RNA-seq reads and produces analysis-ready R objects, providing a "bridge to the Bioconductor universe", where researchers can utilize the powerful existing set of tools to quickly perform desired analyses.

Beginning with a set of FASTQ files (optionally gzipped), SPEAQeasy ultimately produces RangedSummarizedExperiment objects to store gene, exon, and exon-exon junction counts for an experiment. Optionally, expressed regions data is generated, enabling easy computation of differentially expressed regions (DERs).

Our vignette demonstrates how genotype calls by SPEAQeasy can be coupled with user-provided genotype and phenotype data to easily resolve identity issues that arise during sequencing. We then walk through an example differential expression analysis and explore data visualization options.

Pipeline features

Automatically merge samples split across multiple FASTQ files, using the samples.manifest input
Trivially select any GENCODE annotation release for "hg38", "hg19", or "mm10" references (Ensembl for "rat" reference) and adjust other annotation settings with simple configuration
Generates a single VCF file for experiments on human reference, which can be used to resolve sample identity issues and salvage problematic samples
Supports docker to manage software dependencies and is preconfigured for execution locally or on SLURM or SGE clusters
Multiple users can share a single SPEAQeasy installation with minimal work
Detailed, user-friendly logging for transparency and identifying potential issues

Getting started

The SPEAQeasy documentation website describes the pipeline in full detail. For briefly getting started, check out the quick start guide.

Because SPEAQeasy is based on the nextflow workflow manager, it supports execution on computing clusters managed by SLURM or SGE without any configuration (local execution is also possible). Those with access to docker can very simply use docker containers to manage SPEAQeasy software dependencies, though we provide a script for installing dependencies for users without docker or even root privileges.

Authors

Original Pipeline

Emily Burke, Leonardo Collado-Tores, Andrew Jaffe, BaDoi Phan

Nextflow Port

Nick Eagles, Brianna Barry, Jacob Leonard, Israel Aguilar, Violeta Larios, Everardo Gutierrez

Cite `SPEAQeasy`

We hope that SPEAQeasy will be useful for your research. Please use the following bibtex information to cite the software and overall approach. Thank you!

@article {Eagles2021,
	author = {Eagles, Nicholas J. and Burke, Emily E. and Leonard, Jacob and Barry, Brianna K. and Stolz, Joshua M. and Huuki, Louise and Phan, BaDoi N. and Larrios Serrato, Violeta and Guti{\'e}rrez-Mill{\'a}n, Everardo and Aguilar-Ordo{\~n}ez, Israel and Jaffe, Andrew E. and Collado-Torres, Leonardo},
	title = {SPEAQeasy: a scalable pipeline for expression analysis and quantification for R/bioconductor-powered RNA-seq analyses},
	year = {2021},
	doi = {10.1186/s12859-021-04142-3},
	publisher = {Springer Science and Business Media LLC},
	URL = {https://doi.org/10.1186/s12859-021-04142-3},
	journal = {BMC Bioinformatics}
}

Contact

speaqeasy's People

Contributors

Stargazers

Watchers

Forkers

bioinformatics-lab metamaden willey2020 jacobaaronleonard liangma2017

speaqeasy's Issues

Check if "flow" can be run once when annotation files are absent

The current pipeline version needs to be run twice. First, to create (download/ build index) the annotation files. Next to actually process the data.

For example:

https://github.com/LieberInstitute/RNAsp/blob/049679dd1dd2d2636ae54fc7c0b89c364b9a6998/run_test_jhpce.sh
and then
https://github.com/LieberInstitute/RNAsp/blob/049679dd1dd2d2636ae54fc7c0b89c364b9a6998/run_test_jhpce_resume.sh

So:

bash run_test_jhpce.sh
bash run_test_jhpce_resume.sh

Israel Aguilar mentioned that the storeDir could be used to resolve this issue in the "flow" https://www.nextflow.io/docs/latest/process.html#storedir. Hopefully that's right.

Test pipeline at JHPCE with `conda_R/3.5.x` module

Related to #10, update the R scripts so that they work with conda_R/3.5.x (copy and rename the current conf/jhpce.conf to conf/jhpce_R3.4.conf or something like that).

Basically, the non wg_test scripts are the original ones Emily and I wrote for R 3.3.x that we won't support anymore and we should update them so they work with R 3.5.x. You can compare the R 3.3.x scripts vs the wg_test ones (which should work with R 3.4.x on #10).

Update JHPCE config file to new nextflow syntax

Currently the JHPCE config file uses the older nextflow syntax: https://github.com/LieberInstitute/RNAsp/blob/049679dd1dd2d2636ae54fc7c0b89c364b9a6998/conf/jhpce.config

So whenever we run the JHPCE tests, we get a bunch of warning messages. We should fix this config file to avoid that issue

May updates

From Winter Genomics:

Regarding the “Portable RNAseq-pipeline” project, we gladly inform you that the release of the pipeline is ready for deployment in the JHPCE used by the Lieber Institute for Brain Development (LIBD).

This release includes README documentation that explains pipeline setup and configurations required for test runs.

Attached we send you a nextflow report from one of many successful test runs on our dev/test server. A successful run report on your side should look very similar

Features included in this release of the pipeline:

Pipeline is a complete port of the original code provided by Lieber in https://github.com/LieberInstitute/RNAseq-pipeline.
Pipeline can operate for human, mouse and rat; single-end, or paired-end type of data.
Pipeline can use container technology via Docker to increase portability.
Pipeline can run in a cluster under the SGE platform for computational resource management and scalability.

We will be receiving your run reports and observations to make adjustments and debugging labor as part of our commitment to give maintenance to this pipeline.

Please, feel free to contact us if you have any question.

Thank you for your time.

Best regards,

Diana

Transcript Quantification and Variant Calling are disabled for rat

This is controlled via the variables params.step6 and params.step8 such as here https://github.com/LieberInstitute/RNAsp/blob/master/main.nf#L279 and here https://github.com/LieberInstitute/RNAsp/blob/master/main.nf#L296, which are set to false for rat.

Fix regtools docker

Hi,

regtools github repo now points to version 0.5.0 instead of 0.3.0, so the code for making the docker is outdated.

That is:
https://github.com/LieberInstitute/RNAsp/blob/master/dockerfiles/regtools-0.3.0.dockerfile#L14-L19

Here's part of the code that would work:

wget https://github.com/griffithlab/regtools/archive/0.3.0.tar.gz -O regtools-0.3.0.tar.gz && \
    tar -xvf regtools-0.3.0.tar.gz &&\
    cd regtools-0.3.0 && \
    mkdir build && \
    cd build && \
    cmake .. && \
    make

That's what I just did for creating the environment module at JHPCE for regtools 0.3.0 LieberInstitute/jhpce_module_source@941c588

Best,
Leo

allow more featureCounts options

Allow user to additionally specify multimapping, multi-overlapping, and fraction (-M, -O, and --fraction) in featureCounts for gene counts, and -M and --fraction for exon counts (http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf beginning on ~page 38).

FastQC needs to be updated

Updated to the latest FastQC (04-10-18: Version 0.11.8 released) to accommodate NovaSeq, among other bug fixes (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

Possible pipeline crashing when R packages are not all installed

In most processes that run R scripts, the script "check_R_packages.R" is run beforehand to ensure required packages are installed. However, in the case that some packages are not installed, the pipeline is likely to crash (but not guaranteed). The issue is that install.packages() (whether called directly or through BiocManager::install()) locks the install directory during installation so that files are not tampered with by another user or process https://www.rdocumentation.org/packages/utils/versions/3.5.1/topics/install.packages. Since processes in nextflow often execute concurrently, what regularly happens is:

The process starting slightly earlier puts a lock on the directory to install the missing package
Before finishing, the process starting slightly afterward sees the package is not properly installed but fails to install it because the directory is locked (this is actually the intended function of such a lock)

An additional problem that results from this scenario is that it's possible a lock remains on the directory after the pipeline crashes, which could cause confusing errors.

Check pipeline at JHPCE with R 3.4.x

Finish checking pipeline at JHPCE using R module load module R/3.4.x as currently specified in lines like https://github.com/LieberInstitute/RNAsp/blob/049679dd1dd2d2636ae54fc7c0b89c364b9a6998/conf/jhpce.config#L253. Note that the current tests use the wg_test flag https://github.com/LieberInstitute/RNAsp/blob/049679dd1dd2d2636ae54fc7c0b89c364b9a6998/run_test_jhpce.sh#L11 which affects which actual R script is run https://github.com/LieberInstitute/RNAsp/blob/049679dd1dd2d2636ae54fc7c0b89c364b9a6998/main.nf#L503-L507

Restructure use of infer_experiment.py

The pipeline should exclusively rely on user input to set strandedness for all commands. infer_experiment.py/the infer strandedness step should act more as a sanity check, but the pipeline code should only rely on the user's strandedness argument entered from the command line when initiating the pipeline, ie --strand. Certain kinds of BAM files cannot be interpreted by infer_experiment.py, resulting in the failure of downstream steps in the pipeline. The infer strandedness step should, however, throw a warning if the strandedness is not predicted to be the same as the user input or if the step fails. See the commit Leo made to the old pipeline to throw a warning.

Document JHPCE config file

Probably after the sge mode at https://github.com/LieberInstitute/RNAsp/tree/049679dd1dd2d2636ae54fc7c0b89c364b9a6998#sge-mode

Enable multi-threading for FastQC

FastQC allocates 250MB to each thread so some samples will crash if only allowed one thread. Allow users to set more threads with the "-t" option, ie "fastqc -t 4" for four threads. This value could just be "$task.cpus" like for trimmomatic.

Output logs for each process along with results

The pipeline currently does not 'publish' logs for most processes to the results directory. These logs remain in the temporary work directories, which are impractical and confusing to access

Mean Coverage process fails at JHPCE with > ~100 samples

This issue is identical to the one documented by Leo at Ensembl/WiggleTools#10. I'll work around it as suggested there.

Create a documentation website

Likely use readthedocs

Have a quick intro but then all the details described

Update Subread

Update Subread and thus featureCounts to the latest version (Release 2.0.0, 4 Sept 2019 - http://subread.sourceforge.net/) so that input BAM files can contain both unpaired and paired reads. Note also that Subread/featureCounts exists as a module on JHPCE so an updated module should be added.

Jobs submitted to queues other than 'shared' at JHPCE

Very occasionally, I'll see that a process or two gets submitted to a non-shared queue, though no queue was specified (and thus should default to 'shared' at JHPCE). Some queue names I've seen are 'leek', 'beer', and 'hongkai' (a few more I can't remember).

Remove WG test

Remove the WG test code and variables related to it, since we don't need this anymore

Update python + associated modules

The latest version of python on the cluster right now is 3.7.3. We should use this instead of 2.7.9 since support for python 2 is ending. This means we also should use the cluster module for RSeQC 3.0.1 so that infer_experiment.py and bam2wig.py will work with the new python version. I also rewrote the script for bed_to_juncs.py so that it would be compatible with python 3.7.3, and I've also successfully tested it on python 3.8.0. I'll send it over Slack.

Merge "annotation" and "indexing" command-line parameters, and merge "experiment" and "name"

These are optional flags included in the "main" scripts like run_test_jhpce.sh, which we have decided to be redundant

RSeQC infer_experiment.py doesn't work on mixed read types

RSeQC infer_experiment.py will return "data type unknown" if the input BAM contains both paired and unpaired data. This kind of BAM can result from trimming programs (including from Trimmomatic used in this pipeline). An alternative for inferring strandedness that works on these mixed BAM files is Qualimap. Check out Qualimap bamqc for a similar option to infer_experiment.py. Check out Qualimap rnaseq for an option that ~combines some of the outputs from infer_experiment.py and featureCounts. http://qualimap.bioinfo.cipf.es/doc_html/command_line.html

Fix Salmon 0.9.1 docker code

Hi,

If I run the Salmon 0.9.1 docker code https://github.com/LieberInstitute/RNAsp/blob/master/dockerfiles/salmon-0.9.1.dockerfile, I can see that the tar ball already contains a bin folder with the salmon program already there. So there's no need to run cmake or make.

Looks like the following is enough:

wget https://github.com/COMBINE-lab/salmon/releases/download/v0.9.1/Salmon-0.9.1_linux_x86_64.tar.gz && \ 
    tar -xvzf Salmon-0.9.1_linux_x86_64.tar.gz

the code that copies the files to the bin dir.

Best,
Leo

Update Salmon, Kallisto, and Regtools

update to latest: salmon 1.0.0, regtools 0.5.1, and kallisto 0.46.1. Note this new regtools version uses flags -m instead of -i, and -s for strandedness is now required for junctions extract. https://regtools.readthedocs.io/en/latest/commands/junctions-extract/

Resume functionality not fully working

I've observed that nextflow does not always resume where it leaves off. For the process PairedEndHISAT, a sample may complete execution with exit status 0; however, if the pipeline halts later and is resumed, PairedEndHISAT will be re-submitted for the same sample. For large experiments this can be extremely taxing to progress.

In general, it is unclear which processes are well-specified in the sense of resuming as expected. main.nf needs changes to ensure this functionality for all processes- nextflow documents how this can be done here

Check disk lock nextflow req issue

Currently, cloning this repo at the /dcl01/ajaffe disk leads to a disk lock issue. I was able to run this pipeline by cloning the repo at /users/lcollado/test_next/RNAsp. If you can get the error message again and follow up with Mark Miller that'd be great. Israel Aguilar mentioned that this issue can potentially be resolved by a cluster admin.

run_pipeline_slurm.sh doesn't exist

Mentioned in README but not in repo

Create HISAT2 2.0.xx docker files

Once #14 is resolved, lets update the HISAT2 docker file (related also to #13) so that everything uses the latest HISAT2.

Create R 3.5.x docker files

Ideally, we want everything to run on R 3.5.x. If we have docker files for R 3.5.x, then we won't need the wg_test flag anymore (mentioned in #10 and #11).

There are other docker related issues like #3 and #4. We'll leave this and those issues till later (I'll need to figure out the account details for updating the LIBD dockers that Jacobo created).

Create mixed test data

Use ~ 24 samples from BrainSeq Phase 2 with about 5 of them with issues

Select same subjects DLPFC and HIPPO

a few mismatched
a few low quality

2018.02.27

Discussed issues with access to bed_to_juncs_withCount. Turns out that it's equivalent to https://github.com/infphilo/tophat/blob/master/src/bed_to_juncs which was updated a while ago in infphilo/tophat@555c386
Clarified questions about https://github.com/LieberInstitute/RNAseq-pipeline/blob/master/sh/step3-hisat2.sh#L239 and https://github.com/LieberInstitute/RNAseq-pipeline/blob/master/sh/step3-hisat2.sh#L273 using also the attached pdf

┆Attachments: inferred_strandness.pdf

Adapt R code for HISAT 2.0.5 or newer

Once we have the current R scripts working with R 3.5.x (see #11) then we should work on making sure that our parsing scripts for HISAT2 output summary files (originally written by Emily) work with newer HISAT2 versions.

I already created JHPCE modules for 2 newer versions of HISAT2.

Running the pipeline outside RNAsp

Check if the pipeline can be run from other directories instead of inside the RNAsp repo.

I that that some current processes assume that you are running the pipeline out of the RNAsp repo. There might be a way to specify the path to the RNAsp repo using nextflow. If we can do so, then we could potentially write a JHPCE module for RNAsp and have the annotation files pre-built (for the latest versions of the software we'll use).

Script for a massive module for non-docker users

Extract from the docker files or from our own LIBD module READMEs like https://github.com/LieberInstitute/jhpce_mod_source/tree/master/bcftools/1.9 the code for installing all the tools (assuming you don't have root access) and building a big module using lmod such that a single .lua file like https://github.com/LieberInstitute/jhpce_module_config/blob/master/rseqc/2.6.5.lua adds all the required tools to the PATH, PYTHONPATH, etc. Well, maybe without R. That way the tools (or most of them) will be easy to install instead of installing a bunch separately for a user of a cluster that doesn't support docker. Then peers of that user can use the massive module the user made for running RNAsp on their cluster (after changing the read/execute permissions of the installed tools).

Update annotations and make them match

In the annotation directory as well as in main.nf, the latest annotations are not used (for mouse or human, and there may be newer releases for rat). We should probably update to the latest annotations. Additionally, some parts of the pipeline use the primary assembly while others use the whole assembly. This results in warnings about reads mapping to regions not present in the annotation. I'd think we should choose either the primary or the whole annotation and keep it the same throughout? This applies for all species.

Add command line option for trimming options: either variable (based on adapter content) or fixed (trim all samples)

Update trimmomatic

Update trimmomatic to 0.39 (latest version, http://www.usadellab.org/cms/?page=trimmomatic) to allow for better customization during trimming of palindromes

Check and maybe add rat-human genome version

Make README informative and concise

Most notably, information about docker setup is scattered throughout the README, often redundant, and the container versions are outdated. Other adjustments will be made in response to new users of the pipeline.

create_fullCov_object.R doesn't check which chromosomes are present in input files, resulting in crashes

More specifically, lines 45-50 assume which chromosomes are present based on the organism (all chromosomes are assumed, which would fail for samples lacking a Y chromosome, for instance). Commented lines 59-60 document this issue. The pipeline technically runs when the WG-compatible version of the script is used, but this version only uses and assumes the presence of chromosome 1 (essentially a test case).

values not being used in biomart step

https://github.com/LieberInstitute/RNAsp/blob/731302e9b55914b3c4db2b2ed31e864d51063568/scripts/create_count_objects-human.R#L451
For example, in the linked section of code, when calling "getBM", you specify "values" but not "filters" (as per https://www.rdocumentation.org/packages/biomaRt/versions/2.28.0/topics/getBM), and in my testing, this results in no "filtering" step being applied. This could help (though probably not solve) with a previously discussed issue of multi-mapping IDs. This is relevant for the other species' create_counts_objects scripts as well.
Something like the below would work for hg38:
sym <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol","entrezgene_id"), filters="ensembl_gene_id", values=geneMap$ensemblID, mart=ensembl)

Pipeline should need only a manifest as input

As it is, the user specifies the --input directory, which is required to have all the FASTQ files and the samples.manifest file. Since the samples.manifest file contains everything required to locate and process input files, it should be the only file provided to the pipeline explicitly as input. Thus FASTQ files would be free to be located in any directory, and need not be moved to a single location.

Document any hard coded values

For example https://github.com/LieberInstitute/RNAsp/blob/049679dd1dd2d2636ae54fc7c0b89c364b9a6998/main.nf#L1475

change minimum intron size for regtools junctions extract

Change the minimum intron size to 20 instead of our current 9 to agree with our minimum intron size of 20 used in hisat2.

Check exon/transcript annotation info

Check with Emily that the R code in this pipeline is using the latest exon/transcript annotation creating/info that we are using with the previous pipeline (Emily made changes recently to that code)

Clean up redundant code/ variables in main.nf

Such as command-line parameters being initialized starting here https://github.com/LieberInstitute/RNAsp/blob/master/main.nf#L160 and always overwritten starting here https://github.com/LieberInstitute/RNAsp/blob/master/main.nf#L184, and the use of multiple variables for single values like here https://github.com/LieberInstitute/RNAsp/blob/master/main.nf#L432 -434

Change settings for Trimmomatic

Based on modern sequencing data types and downstream processing (including what's implemented here), the default adapters for Trimmomatic ILLUMINACLIP should be TruSeq3-PE-2.fa for paired end data and TruSeq3-SE.fa for single end data. Note, single end data cannot be used to detect palindromes. As such, ILLUMINACLIP defaults should read as follows:

Single end: TruSeq3-SE.fa:2:30:10
Paired end: TruSeq3-PE-2.fa:2:30:10:1:TRUE

The terminal "true" for the paired end command means "keep both reads." See page 7 here for details: http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf. This argument prevents future errors with featureCounts and RSeQC.

The pipeline should also allow for changing the adapters and for inputting your own desired adapter files. The pipeline should also allow for changing these ILLUMINACLIP defaults.

Channel issue with junction annotation for rat

The CountObjects process for rat crashes, as the file specified through the channel here https://github.com/LieberInstitute/RNAsp/blob/master/main.nf#L585 does not make it into the working directory at the time of execution of the process. It appears to be properly put into a channel starting here: https://github.com/LieberInstitute/RNAsp/blob/master/main.nf#L1851.

Begin with a manifest present, and handle file merging as in original pipeline

Merging should follow typical naming conventions (using lanes), such as sample1_L001.fastq.gz, sample1_L002.fastq.gz, sample1_L003.fastq.gz, sample1_L004.fastq.gz for a single-end sample split into 4 files.

Check JHPCE config file `clusterOptions`

Is the current JHPCE config file well specified? https://github.com/LieberInstitute/RNAsp/blob/049679dd1dd2d2636ae54fc7c0b89c364b9a6998/conf/jhpce.config

I'm not 100% sure that we are missing h_fsize as stated in this commit message e56c0a2. Similarly, we might have to state the mem_free option too. I have higher default values of h_fsize which might be why I didn't see this problem. In any case, it'd be great to make sure that the config file is well specified.