pavlidislab / rnaseq-pipeline Goto Github PK

RNA-seq pipeline for raw sequence alignment and transcript/gene quantification.

License: The Unlicense

Shell 2.41% Python 89.09% Makefile 0.07% C 0.85% HTML 7.57%

rna-seq alignment quantification ngs ngs-analysis illumina rsem star stringtie fastqc

rnaseq-pipeline's Introduction

Pavlidis Lab RNA-seq pipeline repository

This documentation is principally written to support the Pavlidis Lab, and we're still updating it. But this pipeline should be fairly easy to configure on any Linux servers using these instructions. External users interested in using this pipeline for RNA-Seq quantification should contact our helpdesk if troubleshooting assistance is needed.

Features

source mechanism that support discovery of samples from GEO, SRA, ArrayExpress and Gemma (in-house curation database)
built with STAR, RSEM, MultiQC, FastQC, and more
produces count and FPKM matrices suitable for analysis with R and Python
distributed via a workload manager thanks to Bioluigi
notify collaborators via Slack API
submit experiments defined in a Google Spreadsheet
web viewer to preview QC reports and download quantifications

Downloading and installing

Clone this repository:

git clone --recurse-submodules https://github.com/PavlidisLab/rnaseq-pipeline
cd rnaseq-pipeline

Note: We use a patched version of RSEM that honors the $TMPDIR environment variable for its intermediate outputs, fix issues with moving files across filesystems and uses STAR shared memory feature by default.

Create and activate a Conda environment with all the required software dependencies:

conda env create -f environment.yml
conda activate rnaseq-pipeline

Build the shared memory cleanup tool:

make -C scripts

Note: We remove unused shared memory objects allocated by STAR in Slurm task prolog and epilog scripts.

Build RSEM:

make -C contrib/RSEM

Install the pipeline Python package in the Conda environment:

pip install . # use -e if you want to edit the pipeline

Create a copy of the example.luigi.cfg file to luigi.cfg. It should work as-is, but you might want to change the output location and the resources.

First, you need to start Luigi scheduler daemon. You can see the progress of your tasks at http://localhost:8082/.

luigid

For convenience, we provide a luigi-wrapper script that sets the --module flag to rnaseq_pipeline.tasks for you.

luigi-wrapper <task> <task_args>

Setting up a genomic reference

The pipeline automatically generate the RSEM/STAR index and all that is required is to drop the GTF annotations file and the primary assembly FASTA files under pipeline-output/genome/<reference_id> subdirectory.

For example, you can setup a mouse reference from Ensembl by downloading the following files under pipeline-output/genomes/mm10_ensembl98:

ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
ftp://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/Mus_musculus.GRCm38.98.gtf.gz

Triggering tasks

The top-level task you will likely want to use is rnaseq_pipeline.tasks.GenerateReportForExperiment.

luigi-wrapper rnaseq_pipeline.tasks.GenerateReportForExperiment --source geo --taxon mouse --reference mm10_ensembl98 --experiment-id GSE80745

The output is organized as follow:

pipeline-output/
    genomes/<reference_id>/                 # Genomic references
    references/<reference_id>/              # RSEM/STAR indexes
    data/<source>                           # FASTQs (note that GEO source uses SRA)
    data-qc/<experiment_id>/<sample_id>/    # FastQC reports
    aligned/<reference_id>/<experiment_id>/ # alignments and quantification results
    quantified/<reference_id>               # quantification matrices for isoforms and genes
    report/<reference_id>/<experiment_id>/  # MultiQC reports for reads and alignments

You can adjust the pipeline output directory by setting OUTPUT_DIR under [rnaseq_pipeline] in the configuration.

[rnaseq_pipeline]
OUTPUT_DIR=/scratch/rnaseq-pipeline

Setting up distributed computation

The pipeline is build upon Bioluigi which supports dispatching external programs on a workload manager such as Slurm.

[bioluigi]
scheduler=slurm
scheduler_extra_args=[]

Web viewer

The pipeline comes with a Web viewer that provides convenient endpoints for consulting QC reports.

When installing, add the webviewer extra require which will include Flask and gunicorn:

pip install .[webviewer]

gunicorn rnaseq_pipeline.viewer:app

Gemma integration

The RNA-Seq pipeline is capable of communicating with Gemma using its RESTful API.

External spreadsheet via Google Sheets API

The RNA-Seq pipeline can pull experiment IDs from a collaborative spreadsheet through the Google Sheets API. This feature requires extra dependencies that are supplied by the gsheet extra require:

pip install .[gsheet]

The rnaseq_pipelines.tasks.SubmitExperimentsFromGoogleSpreadsheetToGemma task becomes available. We also have

submit-experiments-from-gsheet --spreadsheet-id <spreadsheet_id> --sheet-name <sheet_name>

The remote spreadsheet must be structured to have the following columns:

experiment_id, the Gemma exeriment short name
priority, the Luigi task priority, an integer
data, the status of the data, allowed values: ok, resubmit (forces a rerun), needs attention, all other values are ignored

Only experiments with strictly positive priority are scheduled.

rnaseq-pipeline's People

Contributors

Stargazers

Watchers

Forkers

mbelmadani biocoderr wangdepin tobbyxy

rnaseq-pipeline's Issues

LibraryLayout does not always match the number of mates per spot

This is the case for SRR3279144 where the metadata claims it is paired but the output of fastq-dump in single-ended.

Have a single sourceable environment/configuration script

Add a FastQC task

Add a timeout after submitting data to Gemma to account for caching

Treat unusable batch info as a success

Use $OUTPUT_DIR/metadata for storing series and samples metadata

RSEM temporary file cleanup does not always work

I think it might relate to NFS.

Use --origfmt with fastq-dump

Currently, we use the SRA format for FASTQ headers which prefix the SRR run accession to the original string from the sequencer. This format is not compatible with ArrayExpress and local sources and will pose a problem if we try to generalize batch information extraction for arbitrary FASTQs and not just GEO series.

The solution is to add the --origfmt flag to fastq-dump so that the original header will be used instead.

This might require some adjustment in how Gemma parses the batch information.

Fetch metadata from Gemma

Upload MultiQC reports to Gemma

Add quality-control for batch info before submitting it to Gemma

Fix missing stacktrace when GemmaTask fails

Add quality control for alignment results

RSeQC provide a wide range of tools for that purpose.

Remove dependency on DownloadExperiment from AlignExperiment

This would remove the bottleneck between data download and downstream processing.

FastQC fails silently

It causes the following error in Luigi:

Runtime error:
Traceback (most recent call last):
  File "/space/grp/Pipelines/rnaseq-pipeline/venv/lib/python2.7/site-packages/luigi/worker.py", line 184, in run
    raise RuntimeError('Unfulfilled %s at run time: %s' % (deps, ', '.join(missing)))
RuntimeError: Unfulfilled dependency at run time: QualityControlSample_GSE117223_GSM3288306_d6aef47967

Generate MultiQC reports

Relates to: #16

Use a SRX identifier to query the SRA database

Add support for local data

Add DownloadLocalSample and DownloadLocalExperiment tasks to cover case of local samples.

Spring cleanup!

Most of the logic is now in scheduler/tasks.py, so it would be great to clean up the legacy code and get this repository tidy.

Add task for backfilling FASTQ metadata only

Some experiments may need to be re-downloaded for the only purpose of getting the FASTQ metadata. This could be made as a luigi task.

Additional considerations:

Is it possible to speed this up by only downloading a small part of the file?
Are there external meta data repos for GEO/SRA data? (e.g. MetaSRA, http://metasra.biostat.wisc.edu/api.html)
Is it better to overwrite existing (obsolete or missing) metadata for historical purposes or append to it?

STAR fails to to SJ buffer size being too small

EXITING because of fatal error: buffer size for SJ output is too small
Solution: increase input parameter --limitOutSJcollapsed

Add priority to SubmitBatchInfoToGemma tasks

Support extracting batch information from FASTQ headers with underscores

I've encountered a couple of archives in SRA that replaces : separator by _ in their FASTQ headers.

We would have to handle this specific case in ExtractGeoSeriesBatchInfo.

Extend ExternalProgramTask to work with a job scheduler

If we happen to eventually use a job scheduler to distribute the computation, it would be simple to extend ExternalProgramTask to dispatch the work on a job scheduler.

Make it less disruptive to queue new experiments

Upload pipeline metadata to Gemma

There's been discussion about using a subset of CWL to produce a file with structured pipeline metadata.

This process can be entirely automated by introspecting the Luigi task graph and extending ExternalProgramTask to provide additional metadata such as a version number. Ideally, we would move that logic into bioluigi and make all our task CWL-friendly.

Add cutadapt section to MultiQC report

Let sources handle taxons and other metadata

This would basically resolve #13 since we would pull the information from SRA/GEO/Gemma instead of requesting it.

For the local source, we would still need a flag to override a default taxon.

Remove --taxon argument

The information that the taxon provide is redundant because we already have the genome and the annotation reference identifiers.

For that to work, we have to update Gemma platform identifiers to use explicit Ensembl versions.

Remove FASTQs when not needed if source is not local

We need the FASTQ files for two things: alignment tasks and batch information extraction.

At this time, submitting a batch information and quantifying gene expression are two separate and independent tasks that share some common dependencies. This makes it a bit difficult to determine the right moment to clear the FASTQ files.

We could add a wrapper task that depends on any task that needs the FASTQ files to exist and that would be considered completed if all its dependencies are met AND the FASTQ files do not exist.

This logic cannot apply for the local source.

Label SRR with the sample identifier in MultiQC report

This will have the advantage of grouping the alignments statistics with the FASTQ quality controls in the final report.

Handle strand-specific RNA-Seq data

Only consider the relevant subset of samples when assessing batch info

Prepare/Load metadata into Gemma

Addressing request from PavlidisLab/Gemma#20

For GEO datasets/expression experiments, we would like to obtain:

The GSExxxxx
Pipeline parameters (Assembly, software versions, etc)
Fastq header information
Alignment information (may not be straightforward with RSEM)

To start, let's produce a text file in a "METADATA" directory similar to the Data directory.

Adapt the Web viewer for other data sources

Raise warning on download when GEO accession doesn't have matching SRA data

Experiments from GEO can sometimes have an accession but not actually have SRA data. An example is GSE64018.

Currently the pipeline will not see this as a failure on DownloadGSE, but the Qc step will fail (assuming the --nsamples argument from QcGSE is not 0). It would be nicer if DownloadGSE would either return a failure when it can't find a matching SRA accession from the MINIML file, or at least raise a loud warning so the problem is obvious to the pipeline user.

SRA prefetch fails with large downloads

The real issue is that the tool does not exit with a non-zero code when it fails due to a download size limit.

ERCC Spike-ins support

It would be convenient to automate generation of ERCC spike-in count and storing the result for QC purposes.

RSEM section is not appearing the the MultiQC report

STAR does not always cleanup shared memory

This is basically due to memory leakage when a process does not decrement the shared memory usage counter. It would be better to establish the count on the number of attached process on the segment instead.

Limit the number of sratoolkit.Prefetch jobs

We need a mechanism for limiting the number of jobs of this kind since it does not scale with the number running tasks due to bandwith bottleneck.

Migrate tool-specific code to bioluigi

https://github.com/PavlidisLab/bioluigi

Move to Python 3

Luigi (and also the Python core developers...) is about to drop support for Python 2.7 see spotify/luigi#2876.

This is a rather simple move as we only have to update the Conda environment to use Python 3 and possibly adjust some code in the pipeline.

Use CacheControl for GEO/SRA HTTP requests instead of local files

Add a source parameter to determine where the metadata are stored

This is necessary for #14 because the shortName entry might conflict with GEO accessions. Using the experiment_id parameter cannot always disambiguate the data source.

Minimize usage of network between DownloadSample and AlignSample

In the current setup, it's possible that the AlignSample task get scheduled on a different node than the matching DownloadSample task which disrupt the NFS caching.

We could first start by pinning align jobs on whichever node was used to download the data. If that works well, we would then switch to use the local scratch filesystem to completely avoid the NFS.

Migrate to Luigi 3.0.0

migrate the central scheduler to Luigi 3.0.0