Giter Site home page Giter Site logo

lieberinstitute / speaqeasy Goto Github PK

View Code? Open in Web Editor NEW
5.0 13.0 4.0 788.75 MB

SPEAQeasy: portable LIBD RNA-seq pipeline using Nextflow. Check http://research.libd.org/SPEAQeasy-example/ for an example on how to use this pipeline and analyze the resulting output files.

Home Page: http://lieberinstitute.github.io/SPEAQeasy

License: MIT License

R 0.07% Perl 0.01% Shell 0.03% Makefile 0.01% Groovy 0.03% Python 0.01% HTML 99.82% Dockerfile 0.02% Lua 0.01% Nextflow 0.02%
rstats rnaseq nextflow pipeline hisat2 speaqeasy rna-seq-pipeline

speaqeasy's Introduction

SPEAQeasy- a Scalable Pipeline for Expression Analysis and Quantification that is easy to install and share

Summary

SPEAQeasy is a Scalable RNA-seq Pipeline for Expression Analysis and Quantification based on the RNAseq-pipeline. Built on nextflow, and capable of using Docker containers and utilizing common resource managers (e.g. SLURM), this port of the RNAseq-pipeline can be used in different computer environments. It is described in the manuscript here.

The main function of this pipeline is to produce comparable files to those used in recount2, a tool that provides gene, exon, exon-exon junction and base-pair level data.

This pipeline allows researchers to contribute data to the recount2 project even from outside the JHPCE.

Workflow overview

General Workflow

SPEAQeasy takes raw RNA-seq reads and produces analysis-ready R objects, providing a "bridge to the Bioconductor universe", where researchers can utilize the powerful existing set of tools to quickly perform desired analyses.

Beginning with a set of FASTQ files (optionally gzipped), SPEAQeasy ultimately produces RangedSummarizedExperiment objects to store gene, exon, and exon-exon junction counts for an experiment. Optionally, expressed regions data is generated, enabling easy computation of differentially expressed regions (DERs).

Our vignette demonstrates how genotype calls by SPEAQeasy can be coupled with user-provided genotype and phenotype data to easily resolve identity issues that arise during sequencing. We then walk through an example differential expression analysis and explore data visualization options.

Pipeline features

Getting started

The SPEAQeasy documentation website describes the pipeline in full detail. For briefly getting started, check out the quick start guide.

Because SPEAQeasy is based on the nextflow workflow manager, it supports execution on computing clusters managed by SLURM or SGE without any configuration (local execution is also possible). Those with access to docker can very simply use docker containers to manage SPEAQeasy software dependencies, though we provide a script for installing dependencies for users without docker or even root privileges.

Authors

Original Pipeline

Emily Burke, Leonardo Collado-Tores, Andrew Jaffe, BaDoi Phan

Nextflow Port

Nick Eagles, Brianna Barry, Jacob Leonard, Israel Aguilar, Violeta Larios, Everardo Gutierrez

Cite SPEAQeasy

We hope that SPEAQeasy will be useful for your research. Please use the following bibtex information to cite the software and overall approach. Thank you!

@article {Eagles2021,
	author = {Eagles, Nicholas J. and Burke, Emily E. and Leonard, Jacob and Barry, Brianna K. and Stolz, Joshua M. and Huuki, Louise and Phan, BaDoi N. and Larrios Serrato, Violeta and Guti{\'e}rrez-Mill{\'a}n, Everardo and Aguilar-Ordo{\~n}ez, Israel and Jaffe, Andrew E. and Collado-Torres, Leonardo},
	title = {SPEAQeasy: a scalable pipeline for expression analysis and quantification for R/bioconductor-powered RNA-seq analyses},
	year = {2021},
	doi = {10.1186/s12859-021-04142-3},
	publisher = {Springer Science and Business Media LLC},
	URL = {https://doi.org/10.1186/s12859-021-04142-3},
	journal = {BMC Bioinformatics}
}

Contact

speaqeasy's People

Contributors

andrewejaffe avatar biosls avatar evergmillan avatar gpertea avatar iaguilaror avatar lcolladotor avatar nick-eagles avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

speaqeasy's Issues

Check if "flow" can be run once when annotation files are absent

The current pipeline version needs to be run twice. First, to create (download/ build index) the annotation files. Next to actually process the data.

For example:

https://github.com/LieberInstitute/RNAsp/blob/049679dd1dd2d2636ae54fc7c0b89c364b9a6998/run_test_jhpce.sh
and then
https://github.com/LieberInstitute/RNAsp/blob/049679dd1dd2d2636ae54fc7c0b89c364b9a6998/run_test_jhpce_resume.sh

So:

bash run_test_jhpce.sh
bash run_test_jhpce_resume.sh

Israel Aguilar mentioned that the storeDir could be used to resolve this issue in the "flow" https://www.nextflow.io/docs/latest/process.html#storedir. Hopefully that's right.

Test pipeline at JHPCE with `conda_R/3.5.x` module

Related to #10, update the R scripts so that they work with conda_R/3.5.x (copy and rename the current conf/jhpce.conf to conf/jhpce_R3.4.conf or something like that).

Basically, the non wg_test scripts are the original ones Emily and I wrote for R 3.3.x that we won't support anymore and we should update them so they work with R 3.5.x. You can compare the R 3.3.x scripts vs the wg_test ones (which should work with R 3.4.x on #10).

May updates

From Winter Genomics:

Regarding the “Portable RNAseq-pipeline” project, we gladly inform you that the release of the pipeline is ready for deployment in the JHPCE used by the Lieber Institute for Brain Development (LIBD).

This release includes README documentation that explains pipeline setup and configurations required for test runs.

Attached we send you a nextflow report from one of many successful test runs on our dev/test server. A successful run report on your side should look very similar

Features included in this release of the pipeline:

  • Pipeline is a complete port of the original code provided by Lieber in https://github.com/LieberInstitute/RNAseq-pipeline.
  • Pipeline can operate for human, mouse and rat; single-end, or paired-end type of data.
  • Pipeline can use container technology via Docker to increase portability.
  • Pipeline can run in a cluster under the SGE platform for computational resource management and scalability.

We will be receiving your run reports and observations to make adjustments and debugging labor as part of our commitment to give maintenance to this pipeline.

Please, feel free to contact us if you have any question.

Thank you for your time.

Best regards,

Diana

Fix regtools docker

Hi,

regtools github repo now points to version 0.5.0 instead of 0.3.0, so the code for making the docker is outdated.

That is:
https://github.com/LieberInstitute/RNAsp/blob/master/dockerfiles/regtools-0.3.0.dockerfile#L14-L19

Here's part of the code that would work:

wget https://github.com/griffithlab/regtools/archive/0.3.0.tar.gz -O regtools-0.3.0.tar.gz && \
    tar -xvf regtools-0.3.0.tar.gz &&\
    cd regtools-0.3.0 && \
    mkdir build && \
    cd build && \
    cmake .. && \
    make

That's what I just did for creating the environment module at JHPCE for regtools 0.3.0 LieberInstitute/jhpce_module_source@941c588

Best,
Leo

Possible pipeline crashing when R packages are not all installed

In most processes that run R scripts, the script "check_R_packages.R" is run beforehand to ensure required packages are installed. However, in the case that some packages are not installed, the pipeline is likely to crash (but not guaranteed). The issue is that install.packages() (whether called directly or through BiocManager::install()) locks the install directory during installation so that files are not tampered with by another user or process https://www.rdocumentation.org/packages/utils/versions/3.5.1/topics/install.packages. Since processes in nextflow often execute concurrently, what regularly happens is:

  • The process starting slightly earlier puts a lock on the directory to install the missing package

  • Before finishing, the process starting slightly afterward sees the package is not properly installed but fails to install it because the directory is locked (this is actually the intended function of such a lock)

An additional problem that results from this scenario is that it's possible a lock remains on the directory after the pipeline crashes, which could cause confusing errors.

Check pipeline at JHPCE with R 3.4.x

Restructure use of infer_experiment.py

The pipeline should exclusively rely on user input to set strandedness for all commands. infer_experiment.py/the infer strandedness step should act more as a sanity check, but the pipeline code should only rely on the user's strandedness argument entered from the command line when initiating the pipeline, ie --strand. Certain kinds of BAM files cannot be interpreted by infer_experiment.py, resulting in the failure of downstream steps in the pipeline. The infer strandedness step should, however, throw a warning if the strandedness is not predicted to be the same as the user input or if the step fails. See the commit Leo made to the old pipeline to throw a warning.

Enable multi-threading for FastQC

FastQC allocates 250MB to each thread so some samples will crash if only allowed one thread. Allow users to set more threads with the "-t" option, ie "fastqc -t 4" for four threads. This value could just be "$task.cpus" like for trimmomatic.

Output logs for each process along with results

The pipeline currently does not 'publish' logs for most processes to the results directory. These logs remain in the temporary work directories, which are impractical and confusing to access

Update Subread

Update Subread and thus featureCounts to the latest version (Release 2.0.0, 4 Sept 2019 - http://subread.sourceforge.net/) so that input BAM files can contain both unpaired and paired reads. Note also that Subread/featureCounts exists as a module on JHPCE so an updated module should be added.

Jobs submitted to queues other than 'shared' at JHPCE

Very occasionally, I'll see that a process or two gets submitted to a non-shared queue, though no queue was specified (and thus should default to 'shared' at JHPCE). Some queue names I've seen are 'leek', 'beer', and 'hongkai' (a few more I can't remember).

Remove WG test

Remove the WG test code and variables related to it, since we don't need this anymore

Update python + associated modules

The latest version of python on the cluster right now is 3.7.3. We should use this instead of 2.7.9 since support for python 2 is ending. This means we also should use the cluster module for RSeQC 3.0.1 so that infer_experiment.py and bam2wig.py will work with the new python version. I also rewrote the script for bed_to_juncs.py so that it would be compatible with python 3.7.3, and I've also successfully tested it on python 3.8.0. I'll send it over Slack.

RSeQC infer_experiment.py doesn't work on mixed read types

RSeQC infer_experiment.py will return "data type unknown" if the input BAM contains both paired and unpaired data. This kind of BAM can result from trimming programs (including from Trimmomatic used in this pipeline). An alternative for inferring strandedness that works on these mixed BAM files is Qualimap. Check out Qualimap bamqc for a similar option to infer_experiment.py. Check out Qualimap rnaseq for an option that ~combines some of the outputs from infer_experiment.py and featureCounts. http://qualimap.bioinfo.cipf.es/doc_html/command_line.html

Fix Salmon 0.9.1 docker code

Hi,

If I run the Salmon 0.9.1 docker code https://github.com/LieberInstitute/RNAsp/blob/master/dockerfiles/salmon-0.9.1.dockerfile, I can see that the tar ball already contains a bin folder with the salmon program already there. So there's no need to run cmake or make.

Looks like the following is enough:

wget https://github.com/COMBINE-lab/salmon/releases/download/v0.9.1/Salmon-0.9.1_linux_x86_64.tar.gz && \ 
    tar -xvzf Salmon-0.9.1_linux_x86_64.tar.gz
  • the code that copies the files to the bin dir.

Best,
Leo

Resume functionality not fully working

I've observed that nextflow does not always resume where it leaves off. For the process PairedEndHISAT, a sample may complete execution with exit status 0; however, if the pipeline halts later and is resumed, PairedEndHISAT will be re-submitted for the same sample. For large experiments this can be extremely taxing to progress.

In general, it is unclear which processes are well-specified in the sense of resuming as expected. main.nf needs changes to ensure this functionality for all processes- nextflow documents how this can be done here

Check disk lock nextflow req issue

Currently, cloning this repo at the /dcl01/ajaffe disk leads to a disk lock issue. I was able to run this pipeline by cloning the repo at /users/lcollado/test_next/RNAsp. If you can get the error message again and follow up with Mark Miller that'd be great. Israel Aguilar mentioned that this issue can potentially be resolved by a cluster admin.

Create R 3.5.x docker files

Ideally, we want everything to run on R 3.5.x. If we have docker files for R 3.5.x, then we won't need the wg_test flag anymore (mentioned in #10 and #11).

There are other docker related issues like #3 and #4. We'll leave this and those issues till later (I'll need to figure out the account details for updating the LIBD dockers that Jacobo created).

Create mixed test data

Use ~ 24 samples from BrainSeq Phase 2 with about 5 of them with issues

Select same subjects DLPFC and HIPPO

  • a few mismatched
  • a few low quality

Adapt R code for HISAT 2.0.5 or newer

Once we have the current R scripts working with R 3.5.x (see #11) then we should work on making sure that our parsing scripts for HISAT2 output summary files (originally written by Emily) work with newer HISAT2 versions.

I already created JHPCE modules for 2 newer versions of HISAT2.

Running the pipeline outside RNAsp

Check if the pipeline can be run from other directories instead of inside the RNAsp repo.

I that that some current processes assume that you are running the pipeline out of the RNAsp repo. There might be a way to specify the path to the RNAsp repo using nextflow. If we can do so, then we could potentially write a JHPCE module for RNAsp and have the annotation files pre-built (for the latest versions of the software we'll use).

Script for a massive module for non-docker users

Extract from the docker files or from our own LIBD module READMEs like https://github.com/LieberInstitute/jhpce_mod_source/tree/master/bcftools/1.9 the code for installing all the tools (assuming you don't have root access) and building a big module using lmod such that a single .lua file like https://github.com/LieberInstitute/jhpce_module_config/blob/master/rseqc/2.6.5.lua adds all the required tools to the PATH, PYTHONPATH, etc. Well, maybe without R. That way the tools (or most of them) will be easy to install instead of installing a bunch separately for a user of a cluster that doesn't support docker. Then peers of that user can use the massive module the user made for running RNAsp on their cluster (after changing the read/execute permissions of the installed tools).

Update annotations and make them match

In the annotation directory as well as in main.nf, the latest annotations are not used (for mouse or human, and there may be newer releases for rat). We should probably update to the latest annotations. Additionally, some parts of the pipeline use the primary assembly while others use the whole assembly. This results in warnings about reads mapping to regions not present in the annotation. I'd think we should choose either the primary or the whole annotation and keep it the same throughout? This applies for all species.

Make README informative and concise

Most notably, information about docker setup is scattered throughout the README, often redundant, and the container versions are outdated. Other adjustments will be made in response to new users of the pipeline.

create_fullCov_object.R doesn't check which chromosomes are present in input files, resulting in crashes

More specifically, lines 45-50 assume which chromosomes are present based on the organism (all chromosomes are assumed, which would fail for samples lacking a Y chromosome, for instance). Commented lines 59-60 document this issue. The pipeline technically runs when the WG-compatible version of the script is used, but this version only uses and assumes the presence of chromosome 1 (essentially a test case).

values not being used in biomart step

https://github.com/LieberInstitute/RNAsp/blob/731302e9b55914b3c4db2b2ed31e864d51063568/scripts/create_count_objects-human.R#L451
For example, in the linked section of code, when calling "getBM", you specify "values" but not "filters" (as per https://www.rdocumentation.org/packages/biomaRt/versions/2.28.0/topics/getBM), and in my testing, this results in no "filtering" step being applied. This could help (though probably not solve) with a previously discussed issue of multi-mapping IDs. This is relevant for the other species' create_counts_objects scripts as well.
Something like the below would work for hg38:
sym <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol","entrezgene_id"), filters="ensembl_gene_id", values=geneMap$ensemblID, mart=ensembl)

Pipeline should need only a manifest as input

As it is, the user specifies the --input directory, which is required to have all the FASTQ files and the samples.manifest file. Since the samples.manifest file contains everything required to locate and process input files, it should be the only file provided to the pipeline explicitly as input. Thus FASTQ files would be free to be located in any directory, and need not be moved to a single location.

Check exon/transcript annotation info

Check with Emily that the R code in this pipeline is using the latest exon/transcript annotation creating/info that we are using with the previous pipeline (Emily made changes recently to that code)

Change settings for Trimmomatic

Based on modern sequencing data types and downstream processing (including what's implemented here), the default adapters for Trimmomatic ILLUMINACLIP should be TruSeq3-PE-2.fa for paired end data and TruSeq3-SE.fa for single end data. Note, single end data cannot be used to detect palindromes. As such, ILLUMINACLIP defaults should read as follows:

Single end: TruSeq3-SE.fa:2:30:10
Paired end: TruSeq3-PE-2.fa:2:30:10:1:TRUE

The terminal "true" for the paired end command means "keep both reads." See page 7 here for details: http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf. This argument prevents future errors with featureCounts and RSeQC.

The pipeline should also allow for changing the adapters and for inputting your own desired adapter files. The pipeline should also allow for changing these ILLUMINACLIP defaults.

Check JHPCE config file `clusterOptions`

Is the current JHPCE config file well specified? https://github.com/LieberInstitute/RNAsp/blob/049679dd1dd2d2636ae54fc7c0b89c364b9a6998/conf/jhpce.config

I'm not 100% sure that we are missing h_fsize as stated in this commit message e56c0a2. Similarly, we might have to state the mem_free option too. I have higher default values of h_fsize which might be why I didn't see this problem. In any case, it'd be great to make sure that the config file is well specified.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.