hoelzer-lab / rnaflow Goto Github PK

A simple RNA-Seq differential gene expression pipeline using Nextflow

License: GNU General Public License v3.0

Nextflow 3.27% R 1.20% Ruby 0.41% HTML 95.12%

rnaflow's Introduction

RNAflow - An effective and simple RNA-Seq differential gene expression pipeline using Nextflow

Figure 1. Workflow. The user can decide after preprocessing to run a differential gene expression (DEG) analysis or a transcriptome assembly. Circles symbolize input data and download icons symbolize automated download of resources. Steps marked by asterisks are currently only available for some species. See here for a list of references for the used tools and please consider to cite them as well.

Table of Contents

Quick installation
Quick start
Usage
- Input files
- Resume your run
Workflow control
Profiles/configuration options
- Executor options...
- Engine options...
Monitoring
Output
- DESeq2 results
Working offline
Help message
Known bugs and issues
- [Problems with SortMeRNA/ HISAT2 error (#141, #116)](#problems-with-sortmerna-hisat2-error-141-116)
  - Description
  - Workaround
- Latency problems on HPCs, issue (#79)
  - Description
  - Workaround
Citation

Quick installation

The pipeline is written in Nextflow, which can be used on any POSIX compatible system (Linux, OS X, etc). Windows system is supported through WSL. You need Nextflow installed and either conda, Docker, or Singularity to run the steps of the pipeline:

Install Nextflow

click here for a bash one-liner

wget -qO- https://get.nextflow.io | bash
# In the case you don’t have wget
# curl -s https://get.nextflow.io | bash

Install conda

click here for a bash two-liner for Miniconda3 Linux 64-bit

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Install conda

click here for a bash two-liner for Miniconda3 Linux 64-bit

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Install Nextflow via conda

click here to see how to do that

conda create -n nextflow -c bioconda nextflow
conda active nextflow

For transcriptome assembly you have to install also Docker or Singularity.

You can try to simply install Singularity via conda as well

click for an example command

conda create -n singularity -c conda-forge singularity
conda active singularity

or if you already have a conda environment for nextflow:

conda activate nextflow
conda install -c conda-forge singularity

A system admin-configured Singularity installation should be preferred in comparison to an own local conda installation. Please ask your sys admin!

All other dependencies and tools will be installed within the pipeline via conda, Docker or Singularity depending on the profile you run (see below).

Quick start

Start a test run

# conda active nextflow
nextflow run hoelzer-lab/rnaflow -profile test,conda,local

... performs

a differential gene expression analysis with sub-sampled human read data,
on a reduced human genome and annotation (chromosome 1, 10 and 11),
comparing two conditions (MAQCA, MAQCB),
with a local execution (uses max. 4 cores in total and 8GB) and
conda dependency management.

Resource usage

For a local test run (with 30 cores in total at maximum):

nextflow run hoelzer-lab/rnaflow -profile test,conda,local -w work \
--max_cores 30 --cores 10 --softlink_results -r master

we observed the following resource usage including downloads and conda environment creation for each process:

Total runtime: 25m 56s
Physical memory (RAM), max.: 3.015 GB at process hisat2index
Virtual memory (RAM + Disk swap), max.: 10.16 GB at process hisat2

A detailed HTML report automatically produced the pipeline can be found here.

Call help

nextflow run hoelzer-lab/rnaflow --help

Update the pipeline

nextflow pull hoelzer-lab/rnaflow

Use a certain release

We recommend to use a stable release of the pipeline:

nextflow pull hoelzer-lab/rnaflow -r <RELEASE>

Usage

nextflow run hoelzer-lab/rnaflow --reads input.csv --autodownload hsa --pathway hsa --max_cores 6 --cores 2

with --autodownload <hsa|mmu|ssc|mau|eco> build-in species, or define your own genome reference and annotation files in CSV files:

nextflow run hoelzer-lab/rnaflow --reads input.csv --genome fastas.csv --annotation gtfs.csv --max_cores 6 --cores 2

Genomes and annotations from --autodownload, --genome and --annotation are concatenated.

By default, all possible comparisons are performed. Use --deg to change this.

--pathway <hsa|mmu|mau|ssc> performs downstream pathway analysis. Available are WebGestalt set enrichment analysis (GSEA) for hsa, mmu and ssc, piano GSEA with different settings and consensus scoring for hsa, mmu, mau, and ssc.

Input files

Read files (required)

Specify your read files in FASTQ format with --reads input.csv. The file input.csv has to look like this for single-end reads:

Sample,R1,R2,Condition,Source,Strandedness
mock_rep1,/path/to/reads/mock1.fastq.gz,,mock,,0
mock_rep2,/path/to/reads/mock2.fastq.gz,,mock,,0
mock_rep3,/path/to/reads/mock3.fastq.gz,,mock,,0
treated_rep1,/path/to/reads/treat1.fastq.gz,,treated,,0
treated_rep2,/path/to/reads/treat2.fastq.gz,,treated,,0
treated_rep3,/path/to/reads/treat3.fastq.gz,,treated,,0

and for paired-end reads, like this:

Sample,R1,R2,Condition,Source,Strandedness
mock_rep1,/path/to/reads/mock1_1.fastq,/path/to/reads/mock1_2.fastq,mock,A,0
mock_rep2,/path/to/reads/mock2_1.fastq,/path/to/reads/mock2_2.fastq,mock,B,0
mock_rep3,/path/to/reads/mock3_1.fastq,/path/to/reads/mock3_2.fastq,mock,C,0
treated_rep1,/path/to/reads/treat1_1.fastq,/path/to/reads/treat1_2.fastq,treated,A,0
treated_rep2,/path/to/reads/treat2_1.fastq,/path/to/reads/treat2_2.fastq,treated,B,0
treated_rep3,/path/to/reads/treat3_1.fastq,/path/to/reads/treat3_2.fastq,treated,C,0

The first line is a required header. Read files can be compressed (.gz). You need at least two replicates for each condition to run the pipeline. Source labels are optional - the header is still required, the value can be empty as in the single-end example above. Source labels can be used to define the corresponding experiment even more precisely for improved differential expression testing, e.g. if RNA-Seq samples come from different Conditions (e.g. tissues) but the same Sources (e.g. patients). Still, the comparison will be performed between the Conditions but the Source information is additionally used in designing the DESeq2 experiment. Source labels also extend the heatmap sample annotation. Strandedness for the samples can optionally be defined directly in the csv or via the commandline parameter --strand. Where the strandedness column can be any value from: 0 = unstranded, 1 = stranded, 2 = reversely stranded, [default: 0]. Note that if strandedness is provided via the input CSV and the commandline parameter, the value from the command line will be used for the run.

Genomes and annotation

If you don't use one of the build-in species, specify your genomes via --genome fastas.csv, with fastas.csv looking like this:

/path/to/reference_genome1.fasta
/path/to/reference_genome2.fasta

and --annotation gtfs.csv with gtfs.csv looking like this:

/path/to/reference_annotation_1.gtf
/path/to/reference_annotation_2.gtf

You can add a build-in species to your defined genomes and annotation with --autodownload xxx.

Build-in species

We provide a small set of build-in species for which the genome and annotation files are automatically downloaded from Ensembl with --autodownload xxx. Please let us know, we can easily add other species.

Species	three-letter shortcut	Annotation	Genome
Homo sapiens	`hsa` ^*	Homo_sapiens.GRCh38.98	Homo_sapiens.GRCh38.dna.primary_assembly
Mus musculus	`mmu` ^*	Mus_musculus.GRCm38.99	Mus_musculus.GRCm38.dna.primary_assembly
Sus scrofa	`ssc` ^*	Sus_scrofa.Sscrofa11.1.111	Sus_scrofa.Sscrofa11.1.dna.toplevel
Mesocricetus auratus	`mau` ^*	Mesocricetus_auratus.MesAur1.0.100	Mesocricetus_auratus.MesAur1.0.dna.toplevel
Escherichia coli	`eco`	Escherichia_coli_k_12.ASM80076v1.45	Escherichia_coli_k_12.ASM80076v1.dna.toplevel

^* Downstream pathway analysis availible via --pathway xxx.

Multiple-mapped reads

To adjust the handling of multiple-mapped reads during the feature counting process you can use: --featurecounts_additional_params '-t exon -g gene_id -M' The default handling is to only count uniquely mapped reads via featureCounts. With the above flag set featureCounts will also count multi-mapped reads.

Comparisons for DEG analysis

Per default, all possible pairwise comparisons in one direction are performed. Thus, when A is compared against B the pipeline will not automatically compare B vs. A which will anyway only change the direction of the finally resulting fold changes. To change this, please define the needed comparison with --deg comparisons.csv, where each line contains a pairwise comparison:

Condition1,Condition2
conditionX,conditionY
conditionA,conditionB
conditionB,conditionA

The first line is a required header.

Resume your run

You can easily resume your run in case of changes to the parameters or inputs. Nextflow will try to not recalculate steps that are already done:

nextflow run hoelzer-lab/rnaflow -profile test,conda,local -resume

Nextflow will need access to the working directory where temporary calculations are stored. Per default, this is set to work but can be adjusted via -w /path/to/any/workdir. In addition, the .nextflow.log file is needed to resume a run, thus, this will only work if you resume the run from the same folder where you started it.

Workflow control

Preprocessing

--skip_sortmerna                       # skip rRNA removal via SortMeRNA [default false]
--skip_read_preprocessing              # skip preprocessing with fastp [default: false]
--fastp_additional_params              # additional parameters for fastp [default '-5 -3 -W 4 -M 20 -l 15 -x -n 5 -z 6']
--hisat2_additional_params             # additional parameters for HISAT2
--featurecounts_additional_params      # additional parameters for FeatureCounts [default: -t gene -g gene_id]

DEG analysis

--strand                        # strandness for counting with featureCounts: 0 (unstranded), 1 (stranded) and 2 (reversely stranded) [default 0]
--tpm                           # threshold for TPM (transcripts per million) filter [default 1]
--deg                           # a CSV file following the pattern: conditionX,conditionY
--pathway                       # perform different downstream pathway analysis for the species hsa|mmu|mau|ssc
--feature_id_type               # ID type for downstream analysis [default: ensembl_gene_id]

Transcriptome assembly

--assembly                      # switch to transcriptome assembly
--busco_db                      # BUSCO database ['euarchontoglires' or path to existing DB]
--dammit_uniref90               # add UniRef90 to dammit databases, takes long [false]
--rna                           # activate directRNA mode for ONT transcriptome assembly [default: false (cDNA)]

Profiles/configuration options

Per default, the pipeline is locally executed with conda dependency management (corresponds to -profile local,conda). Adjust this setting by combining an executer option with an engine option, e.g. -profile local,conda or -profile slurm,conda. We also provide container support, see below.

Executor options...

... or how to schedule your workload.

Currently implemented are local, slurm and lsf executions.

You can customize local with this parameters:

--cores                         # cores for one process [default 1]
--max_cores                     # max. cores used in total [default allAvailable]
--memory                        # max. memory in GB for local use [default 8 GB]

Engine options...

... or in which environment to run the tools.

Currently implemented are conda, Docker and Singularity. For transcriptome assembly some tools need to be run with Docker or Singularity.

You can switch between different engines via -profile, for example:

nextflow run hoelzer-lab/rnaflow -profile test,local,conda
nextflow run hoelzer-lab/rnaflow -profile test,local,docker
nextflow run hoelzer-lab/rnaflow -profile test,slurm,singularity

As a best practice for a local execution, we recommend to run the pipeline with --cores 1 --max_cores 1 the first time you use Singularity, because we experienced issues when generating the Singularity images in parallel the first time the pipeline is executed with this engine option. It is also possible to run the pipeline once with --setup set. In setup mode all the necessary files (DBs, reference files and images) are being downloaded and set up.

You can customize where conda environments are stored using

--condaCacheDir /path/to/dir

and where Singularity images are stored via

--singularityCacheDir /path/to/dir

Docker images are stored based on your system configuration.

Monitoring

To monitor your computations the pipeline can be connected to Nextflow Tower. You need an user access token to connect your Tower account with the pipeline. Simply generate a login using your email and then click the link send to this address.

"Nextflow Tower does not require a password or registration procedure. Just provide your email address and we'll send you an authentication link to login. That's all!"

Once logged in, click on your avatar on the top right corner and select "Your tokens". Generate a token or copy the default one and set the environment variable:

export TOWER_ACCESS_TOKEN=<YOUR_COPIED_TOKEN>
export NXF_VER=20.10.0

You can save this command to your .bashrc or .profile to not need to enter it again.

Now run:

nextflow run hoelzer-lab/rnaflow -profile test,local,conda -with-tower

Alternatively, you can also activate the Tower connection within the nextflow.config file located in the root GitHub directory:

tower {
    accessToken = ''
    enabled = true
}

You can also directly enter your access token here instead of generating the above environment variable.

Output

The result folder is structured by each step and tool (results/step/tool) as follows:

results/
├── 01-Trimming
│   └── fastp                   trimmed reads
├── 02-rRNARemoval
│   └── SortMeRNA               rRNA-free (and trimmed) reads
├── 03-Mapping
│   └── HISAT2                  mapping results in BAM format with index files (BAI)
├── 04-Counting
│   └── featureCounts           counting table
├── 05-CountingFilter
│   └── TPM                     counting table with additional TPM value; formatted counting table filtered by TPM
├── 06-Annotation               filtered annotation; gene id, name and bio type mapping
├── 07-DifferentialExpression
│   └── DESeq2                  see below
├── 08-Assembly
│   └── de_novo
│      └── Trinity              Trinity assembly  (with --assembly)
├── 09-RNA-Seq_Annotation       BUSCO, dammit and StringTie2 results (with --assembly)
├── Logs                        Nextflow execution timeline and workflow report
└── Summary                     MultiQC report

Please note, that 08-Assembly and 09-RNA-Seq_Annotation are part of the transcriptome assembly branch (--assembly). Here, steps 04 to 07 are currently not applicable.

DESeq2 results

The DESeq2 result is structured as follows:

07-DifferentialExpression/
└── DESeq2
   ├── data                         
   │   ├── counts                   normalized, transformed counts; size factors table
   │   └── input                    DESeq2 input summary
   ├── deseq2.Rout                  R log file
   ├── MAQCA_vs_MAQCB               results for pairwise comparison (here exemplarily for the -profile test data set)
   │   ├── downstream_analysis  
   │   │   ├── piano                piano results
   │   │   └── WebGestalt           WebGestalt results
   │   ├── input                    DESeq2 input summary
   │   ├── plots
   │   │   ├── heatmaps
   │   │   ├── MA
   │   │   ├── PCA
   │   │   ├── sample2sample
   │   │   └── volcano
   │   ├── reports                  DESeq2 result HTML table; summary report
   │   └── results                  raw and filtered DESeq2 result in CSV and XLSX format; DEG analysis summary
   └── plots                        heatmaps and PCA of all samples

We provide DESeq2 normalized, regularized log (rlog), variance stabilized (vsd) and log2(n+1) (ntd) transformed count tables (DESeq2/data/counts).

For each comparison (specified with --deg or, per default, all possible pairwise comparisons in one direction), a new folder X_vs_Y is created. This also describes the direction of the comparison, e.g., the log2FoldChange describes the change of a gene A under condition Y with respect to the gene under condition X. For example, a log2FoldChange of +2 for gene A would tell you that this gene is 2-fold upregulated when we compare condition X vs. condition Y. The gene A is higher expressed in samples belonging to condition X.

Downstream analysis (--pathway xxx) are currently provided for some species: GSEA consensus scoring with piano for Homo sapiens (hsa), Mus musculus (mmu), Mesocricetus auratus (mau), and Sus scofa (ssc); and WebGestalt GSEA for Homo sapiens, Mus musculus, and Sus scrofa.

Working offline

In case you don't have an internet connection, here is a workaround to this issue for manual download and copying of external recourses:

Genomes and annotation can also be specified via --genome and --annotation, see here.
For BUSCO it is a simple download, see here with busco_db = 'euarchontoglires_odb9' as default.
For SortMeRNA and dammit the tools must be installed. Version specifications can be found here and there, the code to create the databases here and there with busco_db = 'euarchontoglires_odb9' dammit_uniref90 = false as default.
Downstream analysis with piano and WebGestalt currently need an internet connection in any case. If no connection is available piano and WebGestalt are skipped.

RNAflow looks up the files here:

nextflow-autodownload-databases     # default: `permanentCacheDir = 'nextflow-autodownload-databases'`
└── databases
    └── busco
        └── <busco_db>.tar.gz
    └── dammit
        └── <busco_db>.tar.gz
        └── uniref90                # in case of `dammit_uniref90 = true`
            └── <busco_db>.tar.gz
    └── sortmerna
        └── data
            └── rRNA_databases

Help message

click here to see the complete help message

Usage examples:
nextflow run hoelzer-lab/rnaflow -profile test,local,conda
nextflow run hoelzer-lab/rnaflow --cores 4 --reads input.csv --autodownload mmu --pathway mmu
nextflow run hoelzer-lab/rnaflow --cores 4 --reads input.csv --autodownload eco --assembly
nextflow run hoelzer-lab/rnaflow --cores 4 --reads input.csv --genome fasta_virus.csv --annotation gtf_virus.csv --autodownload hsa --pathway hsa
Genomes and annotations from --autodownload, --genome and --annotation are concatenated.

Input:
--reads                  A CSV file following the pattern: Sample,R1,R2,Condition,Source,Strandedness - read mode is detected automatically
                                    (check terminal output if correctly assigned)
                                    Per default, all possible comparisons of conditions in one direction are made. Use --deg to change.
--autodownload           Specifies the species identifier for automated download [default: ]
                                    Currently supported are:
                                    - hsa [Ensembl: Homo_sapiens.GRCh38.dna.primary_assembly | Homo_sapiens.GRCh38.98]
                                    - eco [Ensembl: Escherichia_coli_k_12.ASM80076v1.dna.toplevel | Escherichia_coli_k_12.ASM80076v1.45]
                                    - mmu [Ensembl: Mus_musculus.GRCm38.dna.primary_assembly | Mus_musculus.GRCm38.99.gtf]
                                    - ssc [Ensembl: Sus_scrofa.Sscrofa11.1.dna.toplevel | Sus_scrofa.Sscrofa11.1.111 ]
                                    - mau [Ensembl: Mesocricetus_auratus.MesAur1.0.dna.toplevel | Mesocricetus_auratus.MesAur1.0.100]
--species                Specifies the species identifier for downstream path analysis. (DEPRECATED)
                         If `--include_species` is set, reference genome and annotation are added and automatically downloaded. [default: ]
                                    Currently supported are:
                                    - hsa [Ensembl: Homo_sapiens.GRCh38.dna.primary_assembly | Homo_sapiens.GRCh38.98]
                                    - eco [Ensembl: Escherichia_coli_k_12.ASM80076v1.dna.toplevel | Escherichia_coli_k_12.ASM80076v1.45]
                                    - mmu [Ensembl: Mus_musculus.GRCm38.dna.primary_assembly | Mus_musculus.GRCm38.99.gtf]
                                    - mau [Ensembl: Mesocricetus_auratus.MesAur1.0.dna.toplevel | Mesocricetus_auratus.MesAur1.0.100]
--genome                 CSV file with genome reference FASTA files (one path in each line)
                                    If set, --annotation must also be set.
--annotation             CSV file with genome annotation GTF files (one path in each line)
--include_species        Either --species or --genome/--annotation need to be used. Both input seetings can be also combined to use genome and annotation of 
                         supported species in addition to --genome and --annotation [default: false]

Preprocessing options:
--fastp_additional_params          additional parameters for fastp [default: -5 -3 -W 4 -M 20 -l 15 -x -n 5 -z 6]
--skip_sortmerna                   skip rRNA removal via SortMeRNA [default: false] 
--skip_read_preprocessing          skip preprocessing with fastp [default: false]
--hisat2_additional_params         additional parameters for HISAT2 [default: ]
--featurecounts_additional_params  additional parameters for FeatureCounts [default: -t gene -g gene_id]

DEG analysis options:
--strand                 0 (unstranded), 1 (stranded) and 2 (reversely stranded) [default: 0]
                         This will overwrite the optional strandedness defined in the input CSV file.
--tpm                    Threshold for TPM (transcripts per million) filter. A feature is discared, if for all conditions the mean TPM value of all 
                         corresponding samples in this condition is below the threshold. [default: 1]
--deg                    A CSV file following the pattern: conditionX,conditionY
                         Each line stands for one differential gene expression comparison.  
                         Must match the 'Condition' labels defined in the CSV file provided via --reads.  
--pathway                Perform different downstream pathway analysis for the species. [default: ]
                         Currently supported are:
                             - hsa | Homo sapiens
                             - mmu | Mus musculus
                             - mau | Mesocricetus auratus
                             - ssc | Sus scrofa
--feature_id_type        ID type for downstream analysis [default: ensembl_gene_id]                            

Transcriptome assembly options:
--assembly               Perform de novo and reference-based transcriptome assembly instead of DEG analysis [default: false]
--busco_db               The database used with BUSCO [default: euarchontoglires_odb9]
                         Full list of available data sets at https://busco-data.ezlab.org/v5/data/lineages/ 
--dammit_uniref90        Add UniRef90 to the dammit databases (time consuming!) [default: false]
--rna                    Activate directRNA mode for ONT transcriptome assembly [default: false (cDNA)]

Computing options:
--cores                  Max cores per process for local use [default: 1]
--max_cores              Max cores used on the machine for local use [default: 4]
--memory                 Max memory in GB for local use [default: 8 GB]
--output                 Name of the result folder [default: results]

Caching:
--permanentCacheDir      Location for auto-download data like databases [default: nextflow-autodownload-databases]
--condaCacheDir          Location for storing the conda environments [default: conda]
--singularityCacheDir    Location for storing the singularity images [default: singularity]
--workdir                Working directory for all intermediate results [default: null] (DEPRECATED: use `-w your/workdir` instead)
--softlink_results       Softlink result files instead of copying.
--setup                  Download all necessary DB, reference and image files without running the pipeline. [default: false]

Nextflow options:
-with-tower              Activate monitoring via Nextflow Tower (needs TOWER_ACCESS_TOKEN set).
-with-report rep.html    CPU / RAM usage (may cause errors).
-with-dag chart.html     Generates a flowchart for the process tree.
-with-timeline time.html Timeline (may cause errors).

Execution/Engine profiles:
The pipeline supports profiles to run via different Executers and Engines e.g.: -profile local,conda

Executer (choose one):
  local
  slurm
  lsf
  latency

Engines (choose one):
  conda
  mamba
  docker
  singularity

Per default: -profile local,conda is executed. 

For a test run (~ 15 min), add "test" to the profile, e.g. -profile test,local,conda.
The command will create all conda environments and download and run test data.

We also provide some pre-configured profiles for certain HPC environments:    
  ara (slurm, conda and parameter customization)

Known bugs and issues

Problems with `SortMeRNA`/ `HISAT2` error (#141, #116)

Description

The pipeline fails with something like

this

Error executing process > 'preprocess:hisat2 (2)'

Caused by:
  Missing output file(s) `22_rep4_summary.log` expected by process `preprocess:hisat2 (2)`

Command executed:

  hisat2 -x reference -1 22_rep4.R1.other.fastq.gz -2 22_rep4.R2.other.fastq.gz -p 60 --new-summary --summary-file 22_rep4_summary.log  | samtools view -bS | samtools sort -o 22_rep4.sorted.bam -T tmp --threads 60

Command exit status:
  0

Command output:
  (empty)

Command error:
  Error: Read AFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ has more quality values than read characters.
  terminate called after throwing an instance of 'int'
  Aborted (core dumped)
  (ERR): hisat2-align exited with value 134
  [bam_sort_core] merging from 0 files and 60 in-memory blocks...
  grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
  
Work dir:
  /tmp/nextflow-work-as11798/2f/4a5b7060530705c2697bdf3eec73a4

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

Often encountered when running in screen or tmux
Nextflow's -bg option does not help

Workaround

Skip SortMeRNA with --skip_sortmerna
Reads can be cleand beforhand e.g. with CLEAN

Latency problems on HPCs, issue (#79)

Description

Latency related problems with Nextflow might occur when running on HPC systems, where Nextflow expects files to be available before they are fully written to the file system. In these cases Nextflow might get stuck or report missing output or input files to some processes:

ERROR ~ Error executing process > 'some_process'

Caused by:
 Missing output file(s) `some_process.out` expected by process `some_process`

Often encountered when running on HPC systems

Workaround

Please try running the pipeline with the latency profile activated, just add it to the profiles you already defined:

-profile slurm,conda,latency

Citation

If you use RNAflow please cite:

RNAflow

Marie Lataretu and Martin Hölzer. "RNAflow: An effective and simple RNA-Seq differential gene expression pipeline using Nextflow". Genes 2020, 11(12), 1487; https://doi.org/10.3390/genes11121487

rnaflow's People

Contributors

Stargazers

Watchers

Forkers

flomock klamkiew vikash84 najlabioinfo ibrmo cgpu dakehero michael-rhoese sinonkt gabriellovate joanz0019 zm-git-dev danchubb aswitwicka master-xs telatin bingli2019 biomage-org caitl-vale

rnaflow's Issues

compare automatically all conditions

conditions in read-input-csv
default option

Differences between foldchange in CSV tables and html output

For example:

/beegfs/rna-hta/projects/mh_ml_baff/results_deseq2

└─> head -2 memory_control_vs_memory_baff/deseq2_memory_control_memory_baff_filtered_p05.csv 
"","baseMean","log2FoldChange","lfcSE","pvalue","padj"
"ENSG00000102794",10.1653348954568,2.8801372660103,0.70134251946623,3.16660059655261e-06,0.0116353572319729

For this gene, the log2FC is 2.88 but the ReportingTools HTML table tells 3.170:

TPM filter

we should calculate TPM values and filter for them automatically. We could calculate them based on the featurecounts output and only pass genes to DESeq that have a TPM >= 1.

Suggestion:

calculate TPM values for each gene of each individual sample
calculate mean TPM values for each gene over all replicates
if at least one mean TPM value is >= 1 (could be a parameter of the workflow) keep the gene for all samples

Maybe we can use:

deseq2.R ends with exit staus 1

deseq2.R ends with exit staus 1:

Error in file(file, ifelse(append, "a", "w")) : 
  cannot open the connection
Calls: write.csv -> eval.parent -> eval -> eval -> write.table -> file
In addition: Warning message:
In file(file, ifelse(append, "a", "w")) :
  cannot open file 'results/05-DifferentialExpression//deseq2/input.csv': No such file or directory
Execution halted

Input file order for DESeq2

@MarieLataretu please make sure that all the input files are in correct order. I just did a test run and

$ head normalized_counts.csv
"","mock_rep2.counts.formated","mock_rep1.counts.formated","mock_rep3.counts.formated","treat_rep1.counts.formated","treat_rep2.counts.formated","treat_rep3.counts.formated"
"ER3413_4519",51,16.4947669943033,25.8379052219235,41.1815207717244,12.7443759315705,12.9994616423613

shows me first rep2 and then rep1 of mock. This is no problem at all if also the correct read counts are associated with each label all the time. If you think everything is fine, close this.

Make the DEseq2 script configurable

Make the DEseq2 script configurable:

declare conditions
comparisons to perform
patients vector
...

publish mode

use default (absolute sym link) for local execution and copy otherwise

sortmerna in paired-end mode collects non-RNA reads in one fastq file

sortmerna in paired-end mode writes all non-RNA reads into one fastq file.

Do we want to use hisat2 with the -U parameter (like single-end mode), or should we split the file in _1, _2 and singletons?

fastp default parameters

Just copy/pasted from a discussion with @flomock:

Es erscheint mir also sinnvoll das wir -5 -3 auch mit angeben. Eigentlich sollte -3 reichen da die qualitaet ja eigentlich zum Ende hin abflacht. Aber dein Fall illustriert ja gut, das man eben auch 5' nen N oder sowas haben kann (mit low quality) und dann ergibt das -5 auch sinn.

In dem Zusammenhang @MarieLataretu koennte man auch ueberlegen per default polyX am 3' zu trimmen:

  -x, --trim_poly_x                    enable polyX trimming in 3' ends.
      --poly_x_min_len                 the minimum length to detect polyX in the read tail. 10 by default. (int [=10])

Es scheint naemlich so als macht fastp per default nur polyG trimming.

Also mein Vorschlag, da wir eh schon additionalParams implementiert haben nehmen wir die ganzen params aus dem fastp modul raus

fastp -i ${reads[0]} -o ${name}.trimmed.fastq.gz -n 5 --thread ${task.cpus} --json ${name}_fastp.json -z 6 ${additionalParams}

fastp -i ${reads[0]} -o ${name}.trimmed.fastq.gz --thread ${task.cpus} --json ${name}_fastp.json ${additionalParams}

und fuegen wir in die nextflow.config ein:

fastp_additional_params = '-5 -3 -W 4 -M 20 -l 15 -x -n 5 -z 6'

Damit ist klar, wie wir per default trinmmen und man kann es auch ueberschreiben.

DESeq2 script patients input

I'm not sure if this is expected behaviour, or if I don't get the patients input in the DESeq2 script correctly.

c("1","1","1","2","2","2") fails here:

> if (length(patients) > 0) {
+     sampleTable <- data.frame(sampleName = samples, fileName = samples, condition = conditions, type = col.labels, patients = patients)
+     # ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = "", design= ~ patients + condition) # doesn"t work with nextflow
+     ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, design= ~ patients + condition)
+ } else {
+     sampleTable <- data.frame(sampleName = samples, fileName = samples, condition = conditions, type = col.labels)
+     # ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = "", design= ~ condition) # doesn"t work with nextflow
+     ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, design= ~ condition)
+ }
Error in checkFullRank(modelMatrix) :
  the model matrix is not full rank, so the model cannot be fit as specified.
  One or more variables or interaction terms in the design formula are linear
  combinations of the others and must be removed.

  Please read the vignette section 'Model matrix not full rank':

  vignette('DESeq2')
Calls: DESeqDataSetFromHTSeqCount ... DESeqDataSetFromMatrix -> DESeqDataSet -> checkFullRank
Execution halted

The complete command:

R CMD BATCH --no-save --no-restore '--args c(".") c("mock_rep1.counts.filtered.formated","mock_rep2.counts.filtered.formated","mock_rep3.counts.filtered.formated","treat_rep1.counts.filtered.formated","treat_rep2.counts.filtered.formated","treat_rep3.counts.filtered.formated") c("mock","mock","mock","treat","treat","treat") c("mock_rep1","mock_rep2","mock_rep3","treat_rep1","treat_rep2","treat_rep3") c("mock","treat") c("mock:treat","treat:mock") c("eco.id2ensembl") c("eco.gene.gtf") c("eco") c("1","1","1","2","2","2")' deseq2.R

Same input and command with c("1","2","3","1","2","3") as patient does not fail

Update SortMeRNA

We should really update the SortMeRNA version, so that we can make use of multi-threading. At least I hope, it'll run faster then.

We have to take care how it works in paired-end mode.

And we need to figure out how to install the tool.

improve `--cores` option

At the moment --cores sets the number of cores for each process, which means with --cores 20 and e.g. four hisat2 runs in parallel, one would use 80 cores.

I would like to have a parameter --max-cores for the maximum number of cores, that are used in total, and an other parameter or config variable or something to set the cores for each process.

I'm not sure what's the best way to do it in Nextflow.

There is a cpus directive for processes: https://www.nextflow.io/docs/latest/process.html?#cpus
One can set default values for the process directives via the config process scope: https://www.nextflow.io/docs/latest/config.html?#scope-process
One can set cpus via the config profiles: https://www.nextflow.io/docs/latest/config.html?#config-profiles
From the nexflow run options, we have:

-process.
       Set process options
       Syntax: -process.key=value
       Default: {}
-qs, -queue-size
       Max number of processes that can be executed in parallel by each executor

What's your opinion on this?

Pathway analysis

The people always want subsequent pathway analysis of the DEGs.

I used the

Piano R package

for this sometimes. What I really like is

WebGestalt

that we can maybe also use locally?

The identified GO terms could then be visualized with

Revigo

It might be also worth to check current literature for new tools performing GSEA or pathway detection based on DEGs.

generalize refactor_reportingtools_table.rb

I included this script that extends the ReportingTools output table by start/stop positions and gene names. However, the script has some code parts that depend on the input species.

gene_id = row_splitted[0].scan(/ER[0-9]+_[0-9]+/)[0]

new_row = [row_splitted[0].sub('<td class="">',"<td class=\"\"><a target=\"_blank\" href=\"https://bacteria.ensembl.org/Escherichia_coli_k_12/Gene/Summary?g=#{gene_id};\">") + '</a>', "<td class=\"\">#{gene_name}", "<td class=\"\">#{gene_biotype}", pos_part, row_splitted[1].sub('href=','target="_blank" href=').sub('<td class="">','<td class=""><div style="width: 200px">') + '</div>', row_splitted[2], row_splitted[3], row_splitted[4]].join('</td>') << '</td>'

In this example the script works for the input

--species eco

I think we have two options:

A

Remove the URL link to Ensembl that changes with species name and try to generalize the gene_id = part.

B

Depending on the input --species parameter define these values.

Change whole workflow syntax to DSL2

@MarieLataretu I changed the syntax of one process (trimming) to the DSL syntax using modules. I hope this can serve as an example. Otherwise, also see this repository for example:

https://github.com/hoelzer-lab/treat

I think with DSL activated now the workflow does not execute the other processes. The idea is to rewrite the processes that we already have as modules and then continue to extend the workflow.

I can help with translating to the new DSL2 syntax, but I just saw that you have a rb2py branch, I hope you can simply merge the branch with the master because I did not change anything in this code area before we continue to migrate to DSL2.

scripts dir in config

definde the scripts directory in the config

Add sleuth as DEseq2 alternative

sleuth

Ideal: rnaseq Interactive Tool (r package)

Sounds interesting for later reference

https://t.co/ZDjV2hgMiB?amp=1

DESeq2 terminated with an error exit status

for some input DESeq2 fails:

> publish(dds, des2Report.full, pvalueCutoff=1, annotation.db=db, factor = colData(dds)$condition, reportDir=out, n = length(row.names(dds)))
Error in `$<-.data.frame`(`*tmp*`, "Image", value = "<a href=\"figuresRNAseq_analysis_with_DESeq2_full/boxplot..pdf\"><img border=\"0\" src=\"figuresRNAseq_analysis_with_DESeq2_full/mini..png\" alt=\"figuresRNAseq_analysis_with_DESeq2_full/mini..png\"></img></a>") : 
  replacement has 1 row, data has 0
Calls: publish ... modifyReportDF -> .local -> eSetPlot -> $<- -> $<-.data.frame
Execution halted

update profiles

Hi, can you please try

nextflow run hoelzer-lab/rnaseq --reads ~/.nextflow/assets/hoelzer-lab/rnaseq/input.se.eco.csv --species eco --dge ~/.nextflow/assets/hoelzer-lab/rnaseq/input.dge_comparison.csv -profile slurm

And in addition, I am not sure if nextflow will interpreter your '~/' correctly, I think it's better to write the full path with /home/$USER.

Background information

On a cluster system you don't need to specify --cores or such because the cores and memory that will be used is predefined in the configuration file for the specific cluster environment (in this case SLURM), see: https://github.com/hoelzer-lab/rnaseq/blob/master/configs/slurm_conda.config

Then, in the main nextflow.config file, there is the code snippet:

    slurm {
        params.cloudProcess = true
      	params.workdir = "/beegfs/rna-hta/wi74fij/work"
  	params.cloudDatabase = "/beegfs/rna-hta/wi74fij/nextflow-databases/"
        includeConfig 'configs/slurm_conda.config' }

So this setup will be used when you run nextflow with -profile slurm. Per default (and because we did not use this until now) paths with my username are used here.

@MarieLataretu we can simply replace wi74fij with $USER here I think
@MarieLataretu and it makes more sense if we name this profile ara when we set ARA-specific paths. We can also have an additional and empty slurm profile
@MarieLataretu we should also add parameters for --workdir, --cachedir, --databases like done for CLEAN: https://github.com/hoelzer/clean/blob/master/clean.nf#L319

And I think in this context we should also update the workflow to use merged profiles. It's much cleaner and super powerful. I will explain below how and why. And it's totally enough to have this implemented for the conda environments for now (although I will also give examples for Docker below, just to illustrate this). For an example of profile merging please check out the configs here:
https://github.com/hoelzer/poseidon/tree/master/configs

Why merging profiles? Because we want to maintain conda or docker tool versions only in one place but for example use the same conda environments on a local machine and with SLURM (LSF, ...).

What I can do with the above configs in PoSeiDon is

nextflow run hoelzer/poseidon --fasta some.fasta -profile local,docker

and now the config file snippets labeled local and docker in the nextflow.config will be merged according to the labels.

For example:

local.config

process.executor = 'local'

process {
        withLabel: raxml        { cpus = params.cores }
        withLabel: codeml     { cpus = 1 }
}

docker.config

docker { enabled = true }
            
process {   
        withLabel: raxml        { container = 'nanozoo/raxml:8.2.12--27d10cf' }
        withLabel: codeml       { container = 'nanozoo/paml:4.9--cff12e2' }
}

Now, when I run the pipeline with -profile local,docker these configurations will be merged into:

process.executor = 'local'
docker { enabled = true }
            
process {   
        withLabel: raxml        { cpus = params.cores, container = 'nanozoo/raxml:8.2.12--27d10cf' }
        withLabel: codeml       { cpus = 1, container = 'nanozoo/paml:4.9--cff12e2' }
}

With such a scheme, I can now easily add a SLURM profile like

slurm.config

process.executor = 'slurm'
workDir = params.workdir

process {
        withLabel: raxml        { cpus = 32 ; memory = '24 GB' }
        withLabel: codeml     { cpus = 4 ; memory = '2 GB' }
}

and then run the pipeline with -profile slurm,docker

or -profile slurm,conda

or -profile local,conda

and still only need to maintain the tool versions in the docker.config and the environment files used in the conda.config

@MarieLataretu can you try to implement this so we can run the following profiles:

local,conda
slurm,conda
ara,conda (which is essentially the same as above but with some predefined paths)

I think it is worth doing this now because then we can also easily add Docker- and even Cloud-support in the future.

Originally posted by @hoelzer in #36 (comment)

Copy Number Variant detection

https://www.nature.com/articles/s41467-019-13779-x

Remove tRNAs and mitochondrial RNAs from annotation

tRNAs and mitochondrial RNAs can make up a large proportion of mapped reads and thus bias the normalization and differential gene expression procedure.

For example: "tRNA genes and genes covered by fewer than 10 reads in
all tissues were removed."
(https://www.biorxiv.org/content/10.1101/2020.06.14.150599v1.full.pdf)

I am thinking about the following: optional filter parameters to

remove certain contigs from the featurecounts table (--rm_contigs chrM)
remove certain gene types from the featurecounts table (--rm_biotype tRNA)

I would do this optional because in many cases people also want to look at mtRNAs etc...

@MarieLataretu what do you think? Lower priority, can be also a later feature.

add mmu

Update and extend README

add requirements
add more examples with different use cases
...

ReportingTools file output optimization

This is a follow-up for #38 #56

Currently, we produce all PDF/PNG box plots for all genes (that are the input of the R script and survived the TPM filter) here:

https://github.com/hoelzer-lab/rnaseq/blob/master/bin/deseq2.R#L564

And later on, for the pairwise comparisons that have significant genes (adjusted p 0.05) we do this again:

https://github.com/hoelzer-lab/rnaseq/blob/master/bin/deseq2.R#L706

It would be more convenient if we do this once for all genes and then simply point to the correct plots in the HTML tables of the pairwise comparisons. I think the easiest way is

soft-link the figuresRNAseq_analysis_with_DESeq2_full folder to the html folders of the pairwise comparisons

Out of my head, we can probably directly do this in the R script with something like

system(paste("ln -s $PWD/html/figuresRNAseq_analysis_with_DESeq2_full $PDW/",out.sub,"/html/figuresRNAseq_analysis_with_DESeq2",sep=''))

File input check

small suggestion to improve input handling via CSV's
your current code (see also here: line-link):

.map { row -> ["${row[0]}", [file("${row[1]}"), file("${row[2]}")]] }

suggestion:

.map { row -> ["${row[0]}", [file("${row[1]}", checkIfExists: true), file("${row[2]}", checkIfExists: true)]] }

adding this will check directly while mapping if a file exists and gives you a clear error report if, and what file may be missing

€ can't add labels and stuff

user defined database directory

for

sortmeRNA
auto-species-download

fix deseq2 output

for the conditions comparison
statistics and heatmaps is empty

error seems to occur already in summary.txt

#deseq2.res$padj < 0.1:
FALSE TRUE
37815

#deseq2.res$padj < 0.05:
FALSE TRUE
37815

But in fact if I open the RNAseq_analysis_with_DESeq2_full.html and filter for pvalues <0.05 i find over 1000 genes.

missing implementation of comparison specific output

In deseq2.R at line 636 the data is filtered to only contain sicnificant genes and then nothing happens.
Line 636 seems to be very usecase specific refering to a ruby script not includetin rnaseq.

Could you please provide either a general version of line 636 and below or implement a simpler version just saving a html, heatmap and csv.

thanks in advance

sensesful file endings

in results/featurecounts
current ending .counts
better .tsv or .csv

Should we accept input with no replicates?

Should we accept input with no replicates? Or should we let the pipeline run until DESeq2, which complains about it?

> dds <- DESeq(ddsHTSeq, betaPrior = TRUE)
estimating size factors
estimating dispersions
Error in checkForExperimentalReplicates(object, modelMatrix) : 

  The design matrix has the same number of samples and coefficients to fit,
  so estimation of dispersion is not possible. Treating samples
  as replicates was deprecated in v1.20 and no longer supported since v1.22.

Calls: DESeq ... estimateDispersions -> .local -> checkForExperimentalReplicates
Execution halted

Add k-mer based/alignment-free mapper

e.g. Salmon
Kallisto
Sailfish
...?

dream: Powerful differential expression analysis for repeated measures designs

https://www.biorxiv.org/content/10.1101/432567v2

Sounds very interesting!

Final visualization with spark

https://www.biorxiv.org/content/10.1101/845529v1

Heatmaps for pairwise comparisons

It looks like heatmaps for pairwise comparisons show columns for all samples in the data set (here 18 samples) but only labels for the samples that are in the pairwise comparison. I think simply the wrong object is used for plotting (the full is used instead of a substitute)?

Unknown configuration profile: 'conda'

I downloaded rnaseq on Ara using
nextflow pull hoelzer-lab/rnaseq
tested the help
nextflow run hoelzer-lab/rnaseq --help
tried the example (adapted the path as shown in hoelzer/clean)
nextflow run hoelzer-lab/rnaseq --max_cores 6 --cores 2 --reads ~/.nextflow/assets/hoelzer-lab/rnaseq/input.se.eco.csv --species eco --dge ~/.nextflow/assets/hoelzer-lab/rnaseq/input.dge_comparison.csv -profile conda

the output i got:

N E X T F L O W ~ version 20.04.1
Launching hoelzer-lab/rnaseq [intergalactic_euclid] - revision: 99e2759 [master]
Unknown configuration profile: 'conda'

btw.: In the readme is a small mistake, there is a space between the -- and max_cores

optional extended sample infos...

... to make PCAs great again

Fastqc and multiqc

Fastqc for reads and multiqc can be used to make a report about

reads
mapping w/ hisat2
featurecounts
*?

ssconvert alternative

In the deseq2.R script ssconvert is used to convert a CSV to XLSX. I had a rough look but there seems to be no conda env for this linux command. Maybe we find an alternative? XLSX always nice for cooperation partners :)

system(paste("ssconvert ", tmp_out, "/tmp.csv ", out.sub, "/", name, "_full.xlsx", sep=""))

For now, I commented the lines.

Pathway analysis with the Piano R package

We already have this implemented in the deseq2.R script:

piano(out.sub, resBaseMean, resFold, ensembl)

but to function properly we need a proper BioMart object:

ensembl = useMart(biomart = "ENSEMBL_MART_ENSEMBL",dataset="mmusculus_gene_ensembl", host = "apr2020.archive.ensembl.org")

and to have this we somehow need to know in the R script which species (host) is used. And there seem also differences between the general ensemble and bacteria ensemble.

Luckily, piano is available from bioconda so I will already add the package to the deseq env

set cachedir, workdir, ...

... to make it easy to run on a cluster

Alternative trimming/adapter tool

looks interesting and likely faster than fastp:

https://github.com/hellosunking/Ktrim/

low priority though but worth a module and try

errorStrategy { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
cpus { 12 * task.attempt }
memory { 70.GB * task.attempt }
maxRetries 2

memory { task.memory * task.attempt }

should also work.

DESeq2 best practice

We should check that DESeq2 is run in a best possible default way.

For example:

In version 1.16 (November 2016), the log2 fold change shrinkage is no longer default for the DESeq and nbinomWaldTest functions, by setting the defaults of these to betaPrior=FALSE, and by introducing a separate function lfcShrink, which performs log2 fold change shrinkage for visualization and ranking of genes. While for the majority of bulk RNA-seq experiments, the LFC shrinkage did not affect statistical testing, DESeq2 has become used as an inference engine by a wider community, and certain sequencing datasets show better performance with the testing separated from the use of the LFC prior. Also, the separation of LFC shrinkage to a separate function lfcShrink allows for easier methods development of alternative effect size estimators.

http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#standard-workflow

We might also want to use the lfcShrink function.

Morpheus heatmap improvement

https://software.broadinstitute.org/morpheus/

sortmerna version update

There is now a new version on conda

conda create -n sortmerna -c bioconda sortmerna=4.2.0

so we might want to update the YAML file and adjust the command if necessary.

sorting x-axes boxplot

see example from html
my suggestion:
either sort alphabetically
or sort dependent from dge file

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

hoelzer-lab / rnaflow Goto Github PK

rnaflow's Introduction

RNAflow - An effective and simple RNA-Seq differential gene expression pipeline using Nextflow

Quick installation

Quick start

Start a test run

Call help

Update the pipeline

Use a certain release

Usage

Input files

Read files (required)

Genomes and annotation

Build-in species

Multiple-mapped reads

Comparisons for DEG analysis

Resume your run

Workflow control

Preprocessing

DEG analysis

Transcriptome assembly

Profiles/configuration options

Executor options...

Engine options...

Monitoring

Output

DESeq2 results

Working offline

Help message

Known bugs and issues

Problems with SortMeRNA/ HISAT2 error (#141, #116)

Description

Workaround

Latency problems on HPCs, issue (#79)

Description

Workaround

Citation

rnaflow's People

Contributors

Stargazers

Watchers

Forkers

rnaflow's Issues

Make the DEseq2 script configurable:

A

B

Background information

Recommend Projects

Recommend Topics

Recommend Org

Problems with `SortMeRNA`/ `HISAT2` error (#141, #116)