bcbio / bcbior Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 3.59 MB

R package with well defined templates for most common pipelines

Home Page: https://bcbio.github.io/bcbioR/

License: Other

R 93.98% Shell 6.02%

bcbior's People

Contributors

Stargazers

Watchers

Forkers

eberdan

bcbior's Issues

RUVSeq to be optional

Some downstream chunks rely on objects created during RUVseq chunk. To make this more flexible/plug-and-play with chunks people can remove, and since RUVseq probably wouldn't be used in a standard analysis, I would replace object names with more generic ones; in the RUVseq chunk you can just set those results to the more generically named objects if people want to include that particular chunk

potentially use this package for more palette options

https://grafify-vignettes.netlify.app/colour_palettes.html

Common repository (dropbox?) for annotations to add to DE lists

Issue 1: Common repository (dropbox?) for annotations to add to DE lists. People using all kinds of different programs right now.

DE - coefficient vs contrasts

That said, when extracting results, it is best to use coefficients when you can. Using coefficients vs contrasts to extract results produces different results. Apeglm shrinkage method is incompatible with contrasts.

check samplesheet function to ensure files exist

would be nice if the check function for nfcore rnaseq samplesheets actually looked at the fastq file paths and checked to make sure they exist. not sure how feasible this is given s3 paths, though

add checks for low count genes?

High dropouts or general low counts can linger in data despite DESEQ2 filtering. Add a display of zeros and optional filtering for this?

gene plot with dots

Feedback in development branch

Notes about bcbioR process
https://github.com/bcbio/bcbioR
“Set base project” and “Set RNAseq report folder” code do the same thing in terms of setting up file structure and RMD, this is confusing
Link to vignette is missing a “ and then does not work
Steps followed (based on ReadMe)
Skipped everything before “Downstream analysis” because Emma already did this
Modify “information.R”
Modify “QC/QC_nf-core.Rmd” and “params_qc_nf-core.R” because we are running nf-core output template
Make sure all the libraries listed in “QC_nf-core.Rmd” are installed
Knit “QC_nf-core.Rmd”
Create additional plots / colors / descriptions in QC report as needed
In QC_nf-core.Rmd
“load_metadata” chunk removes fastq and strandedness columns twice
Some typos
In “source_params” chunk, “manually” is misspelled
In “prepare metrics” chunk, “General Table of MultiQC” is misspelled
In “Read Metrics” section, all plots have “size=4” in geom_point call except for “tRNA/rRNA mapping rate”
In “PCA” section, the code indicates that PCs 1-5 should be assessed, but only looks at 1-4
Also, would be nice to give examples/options here for adding additional coloring/shapes to PCA plots (other metadata besides factor of interest)
Colors not consistent between heat map and other figures (everything else is yellow/blue, heat map is grey/black)
In “plot_genes_detected” chunk, title of second plot is “Total reads” rather than “Number of genes”

introduce changes from Emma

source(params$params_file)
source(params$project_file)

library(tidyverse)
library(knitr)
library(DESeq2)
library(DEGreport)
library(ggrepel)
library(pheatmap)
# library(RColorBrewer)
library(DT)
library(pheatmap)
library(bcbioR)
library(janitor)
ggplot2::theme_set(theme_light(base_size = 14))
opts_chunk[["set"]](
    cache = FALSE,
    cache.lazy = FALSE,
    dev = c("png", "pdf"),
    error = TRUE,
    highlight = TRUE,
    message = FALSE,
    prompt = FALSE,
    tidy = FALSE,
    warning = FALSE,
    fig.height = 4)

#' Create sub-chunks for plots
#'
#' taken from: https://stackoverflow.com/questions/15365829/dynamic-height-and-width-for-knitr-plots
#'
#' @param pl a plot object
#' @param fig.height figure height
#' @param fig.width figure width
#' @param chunk_name name of the chunk
#'
#' @author Andreas Scharmueller \email{andschar@@protonmail.com}
#'
subchunkify = function(pl,
                       fig.height = 7,
                       fig.width = 5,
                       chunk_name = 'plot') {
  pl_deparsed = paste0(deparse(function() {
    pl
  }), collapse = '')
  
  sub_chunk = paste0(
    "```{r ",
    chunk_name,
    ", fig.height=",
    fig.height,
    ", fig.width=",
    fig.width,
    ", dpi=72",
    ", echo=FALSE, message=FALSE, warning=FALSE, fig.align='center'}",
    "\n(",
    pl_deparsed,
    ")()",
    "\n```"
  )
  
  cat(knitr::knit(
    text = knitr::knit_expand(text = sub_chunk),
    quiet = TRUE
  ))
}

sanitize_datatable = function(df, ...) {
 # remove dashes which cause wrapping
 DT::datatable(df, ..., rownames=gsub("-", "_", rownames(df)),
                   colnames=gsub("-", "_", colnames(df)))
}

Overview

Project: r project
PI: r PI
Analyst: r analyst
Experiment: r experiment
Aim: r aim

Samples and metadata

###Main factor of interest###
# This will be used to order the samples and annotate the QC figures
FOI <- "tissue_type"


## May 8th, I edited the metadata read in to take the input csv from nf-core which MAY have duplicated lines in the case of dup lanes##
meta_df=read_csv(metadata_fn)  %>% arrange(.data[[FOI]]) %>% distinct(sample, .keep_all = T)

order <- meta_df$sample


ggplot(meta_df, aes(.data[[FOI]], fill = .data[[FOI]])) +
  geom_bar() + ylab("") + xlab("") +
  scale_fill_cb_friendly()


metrics <- read_tsv(paste0(multiqc_data_dir, 'multiqc_general_stats.txt'))
metrics_qualimap <- read_tsv(paste0(multiqc_data_dir, 'mqc_qualimap_genomic_origin_1.txt'))

metrics <- metrics %>% full_join(metrics_qualimap)
metrics <- metrics %>%
  clean_names() %>% 
  dplyr::rename_with(~gsub('.*mqc_generalstats_', '', .))

total_reads <- metrics %>% 
  dplyr::filter(grepl('_', sample)) %>% 
  remove_empty(which = 'cols') %>%
  dplyr::rename(single_sample = sample) %>%
  mutate(sample = gsub('_[12]+$', '', single_sample)) %>%
  group_by(sample) %>%
  summarize(total_reads = sum(fastqc_raw_total_sequences))

metrics <- metrics %>% 
  dplyr::filter(!grepl('_', sample)) %>% 
  remove_empty(which = 'cols') %>%
  full_join(total_reads) %>%
  mutate(mapped_reads = samtools_reads_mapped) %>%
  mutate(exonic_rate = exonic/(star_uniquely_mapped * 2)) %>%
  mutate(intronic_rate = intronic/(star_uniquely_mapped * 2)) %>%
  mutate(intergenic_rate = intergenic/(star_uniquely_mapped * 2)) %>%
 # mutate(r_rna_rate = custom_content_biotype_counts_percent_r_rna) %>%
  mutate(x5_3_bias = qualimap_5_3_bias)

metrics <- metrics %>%
    full_join(meta_df , by = c("sample" = "sample"))


### Just for this analysis, not to be permanently added#####
metrics$sample <- gsub("X4","4", metrics$sample)

meta_sm <- meta_df %>%
  as.data.frame() %>%
  column_to_rownames("sample")

meta_sm %>% sanitize_datatable()

Read metrics {.tabset}

Total reads

Here, we want to see consistency and a minimum of 20 million reads.

metrics %>%
    ggplot(aes(x = factor(sample, level = order),
               y = total_reads, 
               fill = .data[[FOI]])) +
    geom_bar(stat = "identity") +
    coord_flip() +
    scale_y_continuous(name = "million reads") +
    scale_fill_cb_friendly() + xlab("") + 
    ggtitle("Total reads")

#get min percent mapped reads for reference
min_pct_mapped <- round(min(metrics$mapped_reads/metrics$total_reads)*100,1)
max_pct_mapped <- round(max(metrics$mapped_reads/metrics$total_reads)*100,1)

Mapping rate

The genomic mapping rate represents the percentage of reads mapping to the reference genome. We want to see consistent mapping rates between samples and over 70% mapping. These samples have mapping rates (r min_pct_mapped - r max_pct_mapped%).

metrics$mapped_reads_pct <- round(metrics$mapped_reads/metrics$total_reads*100,1)
metrics %>%
    ggplot(aes(x = factor(sample, level = order), 
               y = mapped_reads_pct, 
               color = .data[[FOI]])) +
        geom_point() +
    coord_flip() +
    scale_color_cb_friendly() +
    ylim(0, 100) +
    ggtitle("Mapping rate") +
  geom_hline(yintercept=70, color = cb_friendly_cols('blue'))

Number of genes detected

The number of genes represented in every sample is expected to be consistent and over 20K (blue line).

se <- readRDS(se_object)
genes_detected <- colSums(assays(se)[["counts"]] > 0) %>% enframe()
sample_names <- metrics[,c("sample"), drop=F]
genes_detected <- left_join(genes_detected, sample_names, by = c("name" = "sample"))
genes_detected <- genes_detected %>% group_by(name)
genes_detected <- summarise(genes_detected, 
                             n_genes = max(value))

#### Code to fix stupid names, only for this analysis. Remove for git###
genes_detected$name <- gsub("X4","4", genes_detected$name)

                      
metrics <- metrics %>%
  left_join(genes_detected, by = c("sample" = "name"))

ggplot(metrics,aes(x = factor(sample, level = order),
           y = n_genes, fill = .data[[FOI]])) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_cb_friendly() +
  ggtitle("Number of genes") +
  ylab("Number of genes") +
  xlab("") +
  geom_hline(yintercept=20000, color = cb_friendly_cols('blue'))

Gene detection saturation

This plot shows how complex the samples are. We expect samples with more reads to detect more genes.

metrics %>% 
  ggplot(aes(x = total_reads, 
             y = n_genes,
             color = .data[[FOI]])) +
  geom_point()+
  scale_x_log10() +
  scale_color_cb_friendly() +
  ggtitle("Gene saturation") +
  ylab("Number of genes")

Exonic mapping rate

Here we are looking for consistency, and exonic mapping rates around 70% or 75% (blue and red lines, respectively).

metrics %>%
  ggplot(aes(x = factor(sample, level = order),
             y = exonic_rate * 100, 
             color = .data[[FOI]])) +
  geom_point() +
  ylab("Exonic rate %") + 
  ggtitle("Exonic mapping rate") + 
  scale_color_cb_friendly() +
  coord_flip()  +
  xlab("") +
  ylim(c(0,100)) +
  geom_hline(yintercept=70, color = cb_friendly_cols('blue')) +
  geom_hline(yintercept=75, color = cb_friendly_cols('brown'))

Intronic mapping rate

Here, we expect a low intronic mapping rate (≤ 15% - 20%)

metrics %>%
    ggplot(aes(x = factor(sample, level = order),
               y = intronic_rate * 100, 
               color = .data[[FOI]])) +
  geom_point() +
  ylab("Intronic rate %") +
  ggtitle("Intronic mapping rate") + 
  scale_color_cb_friendly() +
  coord_flip()  +
  xlab("") +
  ylim(c(0,100)) +
  geom_hline(yintercept=20, color = cb_friendly_cols('blue')) +
  geom_hline(yintercept=15, color = cb_friendly_cols('brown'))

Intergenic mapping rate

Here, we expect a low intergenic mapping rate, which is true for all samples.

metrics %>%
    ggplot(aes(x = factor(sample, level = order),
               y = intergenic_rate * 100, 
               color = .data[[FOI]])) +
        geom_point() +
    ylab("Intergenic rate %") +
    ggtitle("Intergenic mapping rate") + 
    coord_flip()  + xlab("") +
    scale_color_cb_friendly() +
    ylim(c(0, 100))

rRNA mapping rate

Samples should have a ribosomal RNA (rRNA) "contamination" rate below 10%

# for some bad samples it could be > 50%
rrna_ylim <- max(round(metrics$r_rna_rate*100, 2)) + 10
metrics %>%
    ggplot(aes(x = factor(sample, level = order),
               y = r_rna_rate * 100, 
               color = .data[[FOI]])) +
        geom_point() +
    ylab("rRNA rate, %")+
    ylim(0, rrna_ylim) + 
    ggtitle("rRNA mapping rate") +
    coord_flip() +
    scale_color_cb_friendly()

5'->3' bias

There should be little bias, i.e. the values should be close to 1, or at least consistent among samples

metrics %>%
  ggplot(aes(x = factor(sample, level = order),
             y = x5_3_bias, 
             color = .data[[FOI]])) +
  geom_point() +
  ggtitle("5'-3' bias") + 
  coord_flip() +
  ylim(c(0.5,1.5)) + xlab("") + ylab("5'-3' bias") +
  scale_color_cb_friendly()+
  geom_hline(yintercept=1, color = cb_friendly_cols('blue'))

Counts per gene - all genes

We expect consistency in the box plots here between the samples, i.e. the distribution of counts across the genes is similar

metrics_small <- metrics %>% dplyr::select(sample, .data[[FOI]])
metrics_small <- left_join(sample_names, metrics_small)

counts <- 
  assays(se)[["counts"]] %>% 
  as_tibble() %>%
  filter(rowSums(.)!=0) %>% 
  gather(name, counts) 

### Just for this analysis, not to be permanently added#####
counts$name <- gsub("X4","4", counts$name)


counts <- left_join(counts, metrics_small, by = c("name" = "sample"))

ggplot(counts, aes(factor(name, level = order),
                   log2(counts+1),
                   fill = .data[[FOI]])) +
  geom_boxplot() + 
  scale_fill_cb_friendly() +
  coord_flip() + xlab("") +
  ggtitle("Counts per gene, all non-zero genes") +
  scale_color_cb_friendly()

Sample similarity analysis

In this section, we look at how well the different groups in the dataset cluster with each other. Samples from the same group should ideally be clustering together. We use Principal Component Analysis (PCA).

Principal component analysis (PCA) {.tabset}

Principal Component Analysis (PCA) is a statistical technique used to simplify high-dimensional data by identifying patterns and reducing the number of variables. In the context of gene expression, PCA helps analyze large datasets containing information about the expression levels of thousands of genes across different samples (e.g., tissues, cells).

raw_counts <- assays(se)[["counts"]] %>% round() %>%
    as_tibble() %>%
    filter(rowSums(.)!=0) %>%
    as.matrix()

### Just for this analysis, not to be permanently added#####
colnames(raw_counts) <- gsub("X4","4", colnames(raw_counts))



vst <- vst(raw_counts)

#fix samples names
coldat_for_pca <- as.data.frame(metrics)
rownames(coldat_for_pca) <- coldat_for_pca$sample
coldat_for_pca <- coldat_for_pca[colnames(raw_counts),]
pca1 <- degPCA(vst, coldat_for_pca,
              condition = FOI, data = T)[["plot"]]
pca2 <- degPCA(vst, coldat_for_pca,
              condition = FOI, data = T, pc1="PC3", pc2="PC4")[["plot"]]

pca1 + scale_color_cb_friendly()
pca2 + scale_color_cb_friendly()

## Remove non-useful columns output by nf-core
coldat_2 <- data.frame(coldat_for_pca[,!(colnames(coldat_for_pca) %in% c("fastq_1", "fastq_2", "salmon_library_types", "salmon_compatible_fragment_ratio", "samtools_reads_mapped_percent", "samtools_reads_properly_paired_percent", "samtools_mapped_passed_pct", "strandedness"))])

# Remove missing data
coldat_2 <- na.omit(coldat_2)
degCovariates(vst, metadata = coldat_2)

## Hierarchical clustering

vst_cor <- cor(vst)

annotation_cols <- cb_friendly_pal('grey')(length(unique(coldat_for_pca[[FOI]])))
names(annotation_cols) <- unique(coldat_for_pca[[FOI]])

## Note am still unable to get the annotation colors with the CB palette to work##
pheatmap(vst_cor, annotation_col = coldat_for_pca[,FOI, drop=F], show_rownames = T, show_colnames = T, color = cb_friendly_pal('heatmap')(15))

R session

List and version of tools used for the QC report generation.

sessionInfo()

Support comparing multiple DESeq2 results

QC metrics labels to default size be larger

Accessibility: The colored dots in the QC metrics plots are too small (for me, anyways) to easily read/see what color they are, especially over zoom meetings. This can be easily changed by individuals of course but having the default size be larger might improve readability. Personally, I use barplots, the color of which is hard to miss

make sanitize_datatable into package function

since it gets copied and pasted into so many rmarkdowns, might as well formalize it?

Updated common repository of gene sets (hallmark, reactome, etc).

Issue 3: Updated common repository of gene sets (hallmark, reactome, etc). We keep updating the files (once a year) and R put the date that the file was created into the report so that it can be added to the manuscript.

heatmap with genes and samples

avoid samples with `-` or other character is not `_alphanumeric`

Common repository for lists of protein coding genes

Issue 2: Common repository for lists of protein coding genes for different genomes. Clients often want to see only protein coding.

cutandrun report elements

Present in nf-core multiqc report:

fastqc metrics, including GC content
mapping percentages (bowtie2)
reads passing filters (cutadapt)
trimmed sequence lengths (cutadapt)
unique/multimappers (bowtie2)
alignment metrics and stats (samtools)
duplication stats (picard markDuplicates)
all-sample correlation heatmap (deeptools)
PCA (deeptools)
fingerpring plot (deeptools)
peak counts, individual and consensus (macs2/seacr)

Present in Upen's cutandrun QC report:

total reads
mapping rate
mapped reads
PBC1
PBC2
non-redundant fraction
GC percentage
peak counts
fraction of reads in peaks
all-sample correlation heatmap
PCA plots 1) faceted by antibody 2) all samples 3) zoomed in

Desired in bcbioR template:

total reads
mapping rate
mapped reads
peak counts
?

Options for batch effect in data

RUVseq vs SVA/CombatSeq vs other -- if we are going to have a standardized option for finding and removing sources of unknown variability, we may want to first determine which of these method is best
○ Along those lines, there are QC steps that can/should be done which should be included in determining which unknown source of variation might correspond to known sources of variation in the data; this is important to avoid overfitting the model, if confounding variables are included (known or not)
§ For example, correlating SVs (or equivalent in RUVSeq) with principal components and aspects of metadata

metrics excluded from nfcore

present in bcbio metrics df, currently absent from nf-core metrics df because none are used for plots in QC.rmd, affects degCovariates plot in DE.rmd

duplicates_pct
duplication_rate_of_mapped
r_rna (not r_rna_rate, that is in both already)
percent gc
mapped paired reads (not mapped reads, that is in both already)
average insert size

PCA

I have something that might be useful as an option to include for PCA: I always look at PCA labeled by all metadata and metrics before narrowing down/finalizing what I include in the report for the client. It's not always informative but sometimes will find some unexpected confounding variable in the data that's worth reporting back and/or considering in the analysis going forward. I wrote a function to automatically generate plots for a user-provided list of metadata/metrics over any range of principal components, provided that the metadata/metrics are joined together. I can provide this function if we think it would be useful to include.

DEG rdata annotation assumes human reference

DEG template, load_counts_data chunk to create rdata object from AnnotationDbi

rdata = AnnotationDbi::select(org.Hs.eg.db, rownames(counts), 'SYMBOL', 'ENSEMBL') %>% dplyr::select(gene_id = ENSEMBL, gene_name = SYMBOL)
Assumes human reference

Should base reference used off of genome set in parameters at top of file

If using mouse reference, need the following:

In libraries:
# BiocManager::install("org.Mm.eg.db")
library(org.Mm.eg.db)
In load_counts_data chunk to create rdata object
rdata = AnnotationDbi::select(org.Mm.eg.db, rownames(counts), 'SYMBOL', 'ENSEMBL') %>% dplyr::select(gene_id = ENSEMBL, gene_name = SYMBOL)

get sc script from Noor

Interactive volcano plot with glXYPlot

Interactive volcano plot with glXYPlot, something like in the tutorial below would be very intuitive for clients to look at. You can simply point at a dot in volcano plot to know the gene features and it also displays how the expression of that particular gene is in different condition.

Tutorial: https://combine-australia.github.io/RNAseq-R/06-rnaseq-day1.html#Plots_after_testing_for_DE

similar projects

https://github.com/sansomlab/txseq/blob/main/reports/differential_expression.Rmd

Viridis for Heatmap

Viridis would be both color-blind friendly and more aesthetically pleasing/easier to parse for non-colorblind people

DE_template_example.Rmd - subset discussion

DE_template_example.Rmd
• Consider changing template to not subset the data by default:
○ DESeq best practice is to avoid subsetting the data: "Typically, we recommend users to run samples from all groups together, and then use the contrast argument of the results function to extract comparisons of interest after fitting the model using DESeq."
○ source: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#if-i-have-multiple-groups-should-i-run-all-together-or-split-into-pairs-of-groups
§ One thing to note however is that apeglm shrinkage is incompatible with contrasts, so depending on the individual study and what comparisons need to be made, shrinkage method may need to be swapped for ashr, which is compatible with contrasts

Enrichment analysis flexibility

Over-representation analysis options should ideally include splitting the analysis by up-regulated and down-regulated significantly differently expressed genes; it's usually not as informative to use all significant genes together
• More options for over-rep analysis (clusterprofiler, enrichgo) might be nice

Support multiple contrasts in the same report

add some functionality for multiple contrasts for easy cases where there is nothing to modify

Adapt DE report to nf-core

Add variables to params_de.R for nf-core files.

Create a R function file that would be loaded by DEG.Rmd with a function that will create the count and metrics variable that will be used by DEG.Rmd code downstream.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.