Giter Site home page Giter Site logo

epigen / dea_limma Goto Github PK

View Code? Open in Web Editor NEW
13.0 5.0 0.0 78 KB

A Snakemake workflow for performing and visualizing differential expression analyses (DEA) on NGS data powered by the R package limma.

Home Page: https://epigen.github.io/dea_limma/

License: MIT License

Python 28.36% R 71.64%
atac-seq bioinformatics biomedical-data-science chip-seq differential-expression-analysis limma limma-trend limma-voom rna-seq snakemake visualization volcano-plot workflow scrna-seq statistics r

dea_limma's Introduction

Differential Analysis & Visualization Snakemake Workflow Using limma

DOI

A Snakemake workflow for performing and visualizing differential expression (or accessibility) analyses (DEA) of NGS data (eg RNA-seq, ATAC-seq, scRNA-seq,...) powered by the R package limma.

This workflow adheres to the module specifications of MR.PARETO, an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository.

If you use this workflow in a publication, please don't forget to give credits to the authors by citing it using this DOI 10.5281/zenodo.7808516.

Workflow Rulegraph

Table of contents

Authors

Software

This project wouldn't be possible without the following software and their dependencies:

Software Reference (DOI)
edgeR https://doi.org/10.1093/bioinformatics/btp616
EnhancedVolcano https://doi.org/10.18129/B9.bioc.EnhancedVolcano
ggplot2 https://ggplot2.tidyverse.org/
patchwork https://CRAN.R-project.org/package=patchwork
pheatmap https://cran.r-project.org/package=pheatmap
limma https://doi.org/10.1093/nar/gkv007
Snakemake https://doi.org/10.12688/f1000research.29032.2

Methods

This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (.yaml file) or post execution. Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g. [X].

Differential Expression Analysis (DEA). DEA was performed on the quality-controlled filtered [raw/normalized] counts using the limma (ver) [ref] workflow for fitting a linear model [formula] to identify features (genes/regions) that statistically significantly change with [comparisons] compared to the control group [reference levels] (intercept). Briefly, we determined normalization factors with edgeR::calcNormFactors (optional) using method [X], then applied voom (optional) to estimate the mean-variance relationship of the log-counts. We used blocking on (optional) variable [X] to account for repeated measurements, lmFit to fit the model to the data, and finally eBayes (optional) with the robust (and trend flag – optional for normalized data) flag to compute (moderated/ordinary) t-statistics. For each comparison we used topTable to extract feature-wise average expression, effect sizes (log2 fold change) and their statistical significance as adjusted p-values, determined using the Benjamini-Hochberg method. Furthermore, we calculated feature scores, for each feature in all comparisons, using the formula [score_formula] for downstream ranked enrichment analyses. Next, these results were filtered for relevant features based on the following criteria: statistical significance (adjusted p-value < [X]), effect size (absolute log2 fold change > [X]), and expression (average expression > [X]). Finally, we performed hierarchical clustering on the effect sizes (log2 fold changes) of the union of all relevant features and comparison groups.

Visualization. The filtered result statistics, i.e., number of relevant features split by positive (up) and negative (down) effect sizes, were visualized with stacked bar plots using ggplot (ver) [ref]. To visually summarize results of all performed comparisons, the effect size (log2 fold change) values of all relevant features in at least one comparison were plotted in a hierarchically clustered heatmap using pheatmap (ver) [ref]. Volcano plots were generated for each comparison using EnhancedVolcano (ver) [ref] with adjusted p-value threshold of [pCutoff] and log2 fold change threshold of [FCcutoff] as visual cut-offs for the y- and x-axis, respectively. Finally, quality control plots of the fitted mean-variance relationship and raw p-values of the features were generated.

The analysis and visualizations described were performed using a publicly available Snakemake (ver) [ref] workflow (ver) [10.5281/zenodo.7808516].

Features

The workflow performs the following steps that produce the outlined results:

  • Differential Expression Analysis (DEA) steps:
    • (optional) calculation of normalization factors using edgeR::calcNormFactors.
    • (optional) calculation of precision weights to model the mean-variance relationship in order to make linear models "applicable" to count data (weighted least squares) using voom.
    • (optional) block on a "group" factor in case you have repeated measurements (generalized least squares).
      • example use-case: you have N donors and T timepoints for each donor and want to model donor specific information like age, but still want to account for the variable donor ie ~ timepoint + age + donor is overdetermined hence the formula becomes ~ timepoint + age and you "block" on donor.
    • fit linear models (ordinary least squares) with the design derived from the configured formula (expects "normal" data) using lmFit.
      • the fitted model object is saved (lmfit_object.rds) for alternative downstream analyses or manual inspection e.g., contrasts.
    • (optional) estimate variance "better" using eBayes, with the robustness flag (robust=TRUE), by looking across all genes (i.e. shrunk towards a common value) and compute moderated t-statistics.
      • (optional) eBayes with limma-trend (trend=TRUE)
    • extract all statistics for variables of interest (=configured comparisons) using topTable (eg coefficients/effect size, statistical significance,...).
    • save a feature list per comparison group and direction of change (up/down) for downstream analyses (eg enrichment analysis) (TXT).
      • (optional) annotated feature list with suffix "_annot" (TXT).
    • (optional) save feature score tables (with two columns: "feature" and "score") per comparison group using score_formula for downstream analyses (eg ranked enrichment analysis) (CSV).
      • (optional) annotated feature scores tables (with two columns: "feature_name" and "score") with suffix "_annot" (CSV).
  • DEA result statistics: total number of statistically significant features and split by positive (up) and negative (down) change (CSV).
  • DEA result filtering of features (eg genes) by
    • statistical significance (<= adjusted p-value: adj_pval)
    • effect size (>= absolute log 2 fold change: lfc)
    • average expression (>= ave_expr) in the data (to skip this filter use -Inf)
  • Visualizations
    • filtered DEA result statistics ie number of features and direction (stacked bar plots)
    • volanco plots per comparison with effect size on the x-axis and raw p-value(rawp)/adjusted p-value (adjp) on the y-axis
      • highlighting features according to configured cut-offs for statistical significance (pCutoff) and effect size (FCcutoff)
      • (optional) highlighting features according to configured feature lists
    • hierarchically clustered heatmap of effect sizes (LFC) per comparison (features x comparisons) indicating statistical significance with a star '*'
      • using all relevant features (FILTERED)
      • (optional) using configured feature lists
      • in case of more than 100 features the row labels and significance indicators (*) are removed
      • in case of more than 50000 features no heatmap is generated
    • diagnostic quality control plots
      • (optional) voom mean-variance trend
      • (optional) intermediate mean-variance trend, in case of blocking and vooming
      • post-fitting mean-variance trend
      • raw p-value distributions (to check for p-value inflation in relation to average expression)

FYI

  • Colons (":") in variable/group names (eg interactions) are replaced with double-underscores ("__") downstream.
  • As of now we do not feature more complex contrast scenarios than are supported via topTable, but the fitted linear model is saved for downstream analyses.

Usage

Here are some tips for the usage of this workflow:

  • limma usage and best practices are not explained. For deatiled documentation, tutorials and insctructions see Resources. To understand the implementation at hand see limma.R.
  • Perform your first run(s) with loose filtering options/cut-offs and use the same for visualization to see if further filtering is even necessary or useful.
  • Test minimal complex configurations on a subset of your data.
  • Start with simple models and try to understand the results.

Configuration

Detailed specifications can be found here ./config/README.md

Examples

--- COMING SOON ---

Links

Resources

Publications

The following publications successfully used this module for their analyses.

  • ...

dea_limma's People

Contributors

lukas-folkman avatar sreichl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

dea_limma's Issues

consider capping heatmaps

  • Get input from the team.
  • cap heatmaps like in enrichment analysis e.g.,
    • simple cap ie LFC>cap_th -> set to cap_th
    • log? (log again the LFC?)
    • scale the heatmaps?

Consider: it is a breaking change as it requires a new config field...
same as in dea_seurat: epigen/dea_seurat#15

DEA statistics barplot split of up and down by sign

To get a better overview of the number of differential features in each direction(!), up and down, the barplot could go literally up and down from zero. "down" features need a negative sign (-).

plot_df are the DEA statistics in long format with negative values for "down" values.

# plot config
direction_col <- list("up"="red", "down"="blue")

# plot
ggplot(plot_df, aes(x=groups, y=n_features, fill=direction)) + 
        geom_bar(stat="identity", position="identity") +
        scale_fill_manual(values=direction_col) +
        custom_theme +
        theme(legend.position = "none",
             axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)
             )

consider determining the exact output files from the start for complete compatibility

DEA modules with explicit output to enable usage as module with subsequent modules (avoiding missing input exception) e.g., for enrichment analysis as input.
Requires loading all metadata files and explicitly defining final output files using limma /lmfit variable naming scheme.

Pro:

  • enables smooth usage as a module with explicit outputs to be used for subsequent inputs

Con

  • requires data dependent configuration/annotation -> considered bad practice/to be avoided
    • read up on why and how this translates to this use case

if done, do the same in dea_seurat.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.