Giter Site home page Giter Site logo

bcbior's People

Contributors

abartlett004 avatar eberdan avatar lpantano avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

eberdan

bcbior's Issues

RUVSeq to be optional

Some downstream chunks rely on objects created during RUVseq chunk. To make this more flexible/plug-and-play with chunks people can remove, and since RUVseq probably wouldn't be used in a standard analysis, I would replace object names with more generic ones; in the RUVseq chunk you can just set those results to the more generically named objects if people want to include that particular chunk

DE - coefficient vs contrasts

That said, when extracting results, it is best to use coefficients when you can. Using coefficients vs contrasts to extract results produces different results. Apeglm shrinkage method is incompatible with contrasts.

check samplesheet function to ensure files exist

would be nice if the check function for nfcore rnaseq samplesheets actually looked at the fastq file paths and checked to make sure they exist. not sure how feasible this is given s3 paths, though

add checks for low count genes?

High dropouts or general low counts can linger in data despite DESEQ2 filtering. Add a display of zeros and optional filtering for this?

Feedback in development branch

Notes about bcbioR process
https://github.com/bcbio/bcbioR
“Set base project” and “Set RNAseq report folder” code do the same thing in terms of setting up file structure and RMD, this is confusing
Link to vignette is missing a “ and then does not work
Steps followed (based on ReadMe)
Skipped everything before “Downstream analysis” because Emma already did this
Modify “information.R”
Modify “QC/QC_nf-core.Rmd” and “params_qc_nf-core.R” because we are running nf-core output template
Make sure all the libraries listed in “QC_nf-core.Rmd” are installed
Knit “QC_nf-core.Rmd”
Create additional plots / colors / descriptions in QC report as needed
In QC_nf-core.Rmd
“load_metadata” chunk removes fastq and strandedness columns twice
Some typos
In “source_params” chunk, “manually” is misspelled
In “prepare metrics” chunk, “General Table of MultiQC” is misspelled
In “Read Metrics” section, all plots have “size=4” in geom_point call except for “tRNA/rRNA mapping rate”
In “PCA” section, the code indicates that PCs 1-5 should be assessed, but only looks at 1-4
Also, would be nice to give examples/options here for adding additional coloring/shapes to PCA plots (other metadata besides factor of interest)
Colors not consistent between heat map and other figures (everything else is yellow/blue, heat map is grey/black)
In “plot_genes_detected” chunk, title of second plot is “Total reads” rather than “Number of genes”

introduce changes from Emma

source(params$params_file)
source(params$project_file)
library(tidyverse)
library(knitr)
library(DESeq2)
library(DEGreport)
library(ggrepel)
library(pheatmap)
# library(RColorBrewer)
library(DT)
library(pheatmap)
library(bcbioR)
library(janitor)
ggplot2::theme_set(theme_light(base_size = 14))
opts_chunk[["set"]](
    cache = FALSE,
    cache.lazy = FALSE,
    dev = c("png", "pdf"),
    error = TRUE,
    highlight = TRUE,
    message = FALSE,
    prompt = FALSE,
    tidy = FALSE,
    warning = FALSE,
    fig.height = 4)
#' Create sub-chunks for plots
#'
#' taken from: https://stackoverflow.com/questions/15365829/dynamic-height-and-width-for-knitr-plots
#'
#' @param pl a plot object
#' @param fig.height figure height
#' @param fig.width figure width
#' @param chunk_name name of the chunk
#'
#' @author Andreas Scharmueller \email{andschar@@protonmail.com}
#'
subchunkify = function(pl,
                       fig.height = 7,
                       fig.width = 5,
                       chunk_name = 'plot') {
  pl_deparsed = paste0(deparse(function() {
    pl
  }), collapse = '')
  
  sub_chunk = paste0(
    "```{r ",
    chunk_name,
    ", fig.height=",
    fig.height,
    ", fig.width=",
    fig.width,
    ", dpi=72",
    ", echo=FALSE, message=FALSE, warning=FALSE, fig.align='center'}",
    "\n(",
    pl_deparsed,
    ")()",
    "\n```"
  )
  
  cat(knitr::knit(
    text = knitr::knit_expand(text = sub_chunk),
    quiet = TRUE
  ))
}

sanitize_datatable = function(df, ...) {
 # remove dashes which cause wrapping
 DT::datatable(df, ..., rownames=gsub("-", "_", rownames(df)),
                   colnames=gsub("-", "_", colnames(df)))
}

Overview

  • Project: r project
  • PI: r PI
  • Analyst: r analyst
  • Experiment: r experiment
  • Aim: r aim

Samples and metadata

###Main factor of interest###
# This will be used to order the samples and annotate the QC figures
FOI <- "tissue_type"

## May 8th, I edited the metadata read in to take the input csv from nf-core which MAY have duplicated lines in the case of dup lanes##
meta_df=read_csv(metadata_fn)  %>% arrange(.data[[FOI]]) %>% distinct(sample, .keep_all = T)

order <- meta_df$sample


ggplot(meta_df, aes(.data[[FOI]], fill = .data[[FOI]])) +
  geom_bar() + ylab("") + xlab("") +
  scale_fill_cb_friendly()

metrics <- read_tsv(paste0(multiqc_data_dir, 'multiqc_general_stats.txt'))
metrics_qualimap <- read_tsv(paste0(multiqc_data_dir, 'mqc_qualimap_genomic_origin_1.txt'))

metrics <- metrics %>% full_join(metrics_qualimap)
metrics <- metrics %>%
  clean_names() %>% 
  dplyr::rename_with(~gsub('.*mqc_generalstats_', '', .))

total_reads <- metrics %>% 
  dplyr::filter(grepl('_', sample)) %>% 
  remove_empty(which = 'cols') %>%
  dplyr::rename(single_sample = sample) %>%
  mutate(sample = gsub('_[12]+$', '', single_sample)) %>%
  group_by(sample) %>%
  summarize(total_reads = sum(fastqc_raw_total_sequences))

metrics <- metrics %>% 
  dplyr::filter(!grepl('_', sample)) %>% 
  remove_empty(which = 'cols') %>%
  full_join(total_reads) %>%
  mutate(mapped_reads = samtools_reads_mapped) %>%
  mutate(exonic_rate = exonic/(star_uniquely_mapped * 2)) %>%
  mutate(intronic_rate = intronic/(star_uniquely_mapped * 2)) %>%
  mutate(intergenic_rate = intergenic/(star_uniquely_mapped * 2)) %>%
 # mutate(r_rna_rate = custom_content_biotype_counts_percent_r_rna) %>%
  mutate(x5_3_bias = qualimap_5_3_bias)

metrics <- metrics %>%
    full_join(meta_df , by = c("sample" = "sample"))


### Just for this analysis, not to be permanently added#####
metrics$sample <- gsub("X4","4", metrics$sample)
meta_sm <- meta_df %>%
  as.data.frame() %>%
  column_to_rownames("sample")

meta_sm %>% sanitize_datatable()

Read metrics {.tabset}

Total reads

Here, we want to see consistency and a minimum of 20 million reads.

metrics %>%
    ggplot(aes(x = factor(sample, level = order),
               y = total_reads, 
               fill = .data[[FOI]])) +
    geom_bar(stat = "identity") +
    coord_flip() +
    scale_y_continuous(name = "million reads") +
    scale_fill_cb_friendly() + xlab("") + 
    ggtitle("Total reads") 

#get min percent mapped reads for reference
min_pct_mapped <- round(min(metrics$mapped_reads/metrics$total_reads)*100,1)
max_pct_mapped <- round(max(metrics$mapped_reads/metrics$total_reads)*100,1)

Mapping rate

The genomic mapping rate represents the percentage of reads mapping to the reference genome. We want to see consistent mapping rates between samples and over 70% mapping. These samples have mapping rates (r min_pct_mapped - r max_pct_mapped%).

metrics$mapped_reads_pct <- round(metrics$mapped_reads/metrics$total_reads*100,1)
metrics %>%
    ggplot(aes(x = factor(sample, level = order), 
               y = mapped_reads_pct, 
               color = .data[[FOI]])) +
        geom_point() +
    coord_flip() +
    scale_color_cb_friendly() +
    ylim(0, 100) +
    ggtitle("Mapping rate") +
  geom_hline(yintercept=70, color = cb_friendly_cols('blue'))

Number of genes detected

The number of genes represented in every sample is expected to be consistent and over 20K (blue line).

se <- readRDS(se_object)
genes_detected <- colSums(assays(se)[["counts"]] > 0) %>% enframe()
sample_names <- metrics[,c("sample"), drop=F]
genes_detected <- left_join(genes_detected, sample_names, by = c("name" = "sample"))
genes_detected <- genes_detected %>% group_by(name)
genes_detected <- summarise(genes_detected, 
                             n_genes = max(value))

#### Code to fix stupid names, only for this analysis. Remove for git###
genes_detected$name <- gsub("X4","4", genes_detected$name)

                      
metrics <- metrics %>%
  left_join(genes_detected, by = c("sample" = "name"))

ggplot(metrics,aes(x = factor(sample, level = order),
           y = n_genes, fill = .data[[FOI]])) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_cb_friendly() +
  ggtitle("Number of genes") +
  ylab("Number of genes") +
  xlab("") +
  geom_hline(yintercept=20000, color = cb_friendly_cols('blue'))

Gene detection saturation

This plot shows how complex the samples are. We expect samples with more reads to detect more genes.

metrics %>% 
  ggplot(aes(x = total_reads, 
             y = n_genes,
             color = .data[[FOI]])) +
  geom_point()+
  scale_x_log10() +
  scale_color_cb_friendly() +
  ggtitle("Gene saturation") +
  ylab("Number of genes")

Exonic mapping rate

Here we are looking for consistency, and exonic mapping rates around 70% or 75% (blue and red lines, respectively).

metrics %>%
  ggplot(aes(x = factor(sample, level = order),
             y = exonic_rate * 100, 
             color = .data[[FOI]])) +
  geom_point() +
  ylab("Exonic rate %") + 
  ggtitle("Exonic mapping rate") + 
  scale_color_cb_friendly() +
  coord_flip()  +
  xlab("") +
  ylim(c(0,100)) +
  geom_hline(yintercept=70, color = cb_friendly_cols('blue')) +
  geom_hline(yintercept=75, color = cb_friendly_cols('brown'))

Intronic mapping rate

Here, we expect a low intronic mapping rate (≤ 15% - 20%)

metrics %>%
    ggplot(aes(x = factor(sample, level = order),
               y = intronic_rate * 100, 
               color = .data[[FOI]])) +
  geom_point() +
  ylab("Intronic rate %") +
  ggtitle("Intronic mapping rate") + 
  scale_color_cb_friendly() +
  coord_flip()  +
  xlab("") +
  ylim(c(0,100)) +
  geom_hline(yintercept=20, color = cb_friendly_cols('blue')) +
  geom_hline(yintercept=15, color = cb_friendly_cols('brown'))

Intergenic mapping rate

Here, we expect a low intergenic mapping rate, which is true for all samples.

metrics %>%
    ggplot(aes(x = factor(sample, level = order),
               y = intergenic_rate * 100, 
               color = .data[[FOI]])) +
        geom_point() +
    ylab("Intergenic rate %") +
    ggtitle("Intergenic mapping rate") + 
    coord_flip()  + xlab("") +
    scale_color_cb_friendly() +
    ylim(c(0, 100))

rRNA mapping rate

Samples should have a ribosomal RNA (rRNA) "contamination" rate below 10%

# for some bad samples it could be > 50%
rrna_ylim <- max(round(metrics$r_rna_rate*100, 2)) + 10
metrics %>%
    ggplot(aes(x = factor(sample, level = order),
               y = r_rna_rate * 100, 
               color = .data[[FOI]])) +
        geom_point() +
    ylab("rRNA rate, %")+
    ylim(0, rrna_ylim) + 
    ggtitle("rRNA mapping rate") +
    coord_flip() +
    scale_color_cb_friendly()

5'->3' bias

There should be little bias, i.e. the values should be close to 1, or at least consistent among samples

metrics %>%
  ggplot(aes(x = factor(sample, level = order),
             y = x5_3_bias, 
             color = .data[[FOI]])) +
  geom_point() +
  ggtitle("5'-3' bias") + 
  coord_flip() +
  ylim(c(0.5,1.5)) + xlab("") + ylab("5'-3' bias") +
  scale_color_cb_friendly()+
  geom_hline(yintercept=1, color = cb_friendly_cols('blue'))

Counts per gene - all genes

We expect consistency in the box plots here between the samples, i.e. the distribution of counts across the genes is similar

metrics_small <- metrics %>% dplyr::select(sample, .data[[FOI]])
metrics_small <- left_join(sample_names, metrics_small)

counts <- 
  assays(se)[["counts"]] %>% 
  as_tibble() %>%
  filter(rowSums(.)!=0) %>% 
  gather(name, counts) 

### Just for this analysis, not to be permanently added#####
counts$name <- gsub("X4","4", counts$name)


counts <- left_join(counts, metrics_small, by = c("name" = "sample"))

ggplot(counts, aes(factor(name, level = order),
                   log2(counts+1),
                   fill = .data[[FOI]])) +
  geom_boxplot() + 
  scale_fill_cb_friendly() +
  coord_flip() + xlab("") +
  ggtitle("Counts per gene, all non-zero genes") +
  scale_color_cb_friendly()

Sample similarity analysis

In this section, we look at how well the different groups in the dataset cluster with each other. Samples from the same group should ideally be clustering together. We use Principal Component Analysis (PCA).

Principal component analysis (PCA) {.tabset}

Principal Component Analysis (PCA) is a statistical technique used to simplify high-dimensional data by identifying patterns and reducing the number of variables. In the context of gene expression, PCA helps analyze large datasets containing information about the expression levels of thousands of genes across different samples (e.g., tissues, cells).

raw_counts <- assays(se)[["counts"]] %>% round() %>%
    as_tibble() %>%
    filter(rowSums(.)!=0) %>%
    as.matrix()

### Just for this analysis, not to be permanently added#####
colnames(raw_counts) <- gsub("X4","4", colnames(raw_counts))



vst <- vst(raw_counts)

#fix samples names
coldat_for_pca <- as.data.frame(metrics)
rownames(coldat_for_pca) <- coldat_for_pca$sample
coldat_for_pca <- coldat_for_pca[colnames(raw_counts),]
pca1 <- degPCA(vst, coldat_for_pca,
              condition = FOI, data = T)[["plot"]]
pca2 <- degPCA(vst, coldat_for_pca,
              condition = FOI, data = T, pc1="PC3", pc2="PC4")[["plot"]]

pca1 + scale_color_cb_friendly()
pca2 + scale_color_cb_friendly()
## Remove non-useful columns output by nf-core
coldat_2 <- data.frame(coldat_for_pca[,!(colnames(coldat_for_pca) %in% c("fastq_1", "fastq_2", "salmon_library_types", "salmon_compatible_fragment_ratio", "samtools_reads_mapped_percent", "samtools_reads_properly_paired_percent", "samtools_mapped_passed_pct", "strandedness"))])

# Remove missing data
coldat_2 <- na.omit(coldat_2)
degCovariates(vst, metadata = coldat_2)
## Hierarchical clustering

vst_cor <- cor(vst)

annotation_cols <- cb_friendly_pal('grey')(length(unique(coldat_for_pca[[FOI]])))
names(annotation_cols) <- unique(coldat_for_pca[[FOI]])

## Note am still unable to get the annotation colors with the CB palette to work##
pheatmap(vst_cor, annotation_col = coldat_for_pca[,FOI, drop=F], show_rownames = T, show_colnames = T, color = cb_friendly_pal('heatmap')(15))

R session

List and version of tools used for the QC report generation.

sessionInfo()

QC metrics labels to default size be larger

Accessibility: The colored dots in the QC metrics plots are too small (for me, anyways) to easily read/see what color they are, especially over zoom meetings. This can be easily changed by individuals of course but having the default size be larger might improve readability. Personally, I use barplots, the color of which is hard to miss

cutandrun report elements

Present in nf-core multiqc report:

  • fastqc metrics, including GC content
  • mapping percentages (bowtie2)
  • reads passing filters (cutadapt)
  • trimmed sequence lengths (cutadapt)
  • unique/multimappers (bowtie2)
  • alignment metrics and stats (samtools)
  • duplication stats (picard markDuplicates)
  • all-sample correlation heatmap (deeptools)
  • PCA (deeptools)
  • fingerpring plot (deeptools)
  • peak counts, individual and consensus (macs2/seacr)

Present in Upen's cutandrun QC report:

  • total reads
  • mapping rate
  • mapped reads
  • PBC1
  • PBC2
  • non-redundant fraction
  • GC percentage
  • peak counts
  • fraction of reads in peaks
  • all-sample correlation heatmap
  • PCA plots 1) faceted by antibody 2) all samples 3) zoomed in

Desired in bcbioR template:

  • total reads
  • mapping rate
  • mapped reads
  • peak counts
  • ?

Options for batch effect in data

RUVseq vs SVA/CombatSeq vs other -- if we are going to have a standardized option for finding and removing sources of unknown variability, we may want to first determine which of these method is best
○ Along those lines, there are QC steps that can/should be done which should be included in determining which unknown source of variation might correspond to known sources of variation in the data; this is important to avoid overfitting the model, if confounding variables are included (known or not)
§ For example, correlating SVs (or equivalent in RUVSeq) with principal components and aspects of metadata

metrics excluded from nfcore

present in bcbio metrics df, currently absent from nf-core metrics df because none are used for plots in QC.rmd, affects degCovariates plot in DE.rmd

  • duplicates_pct
  • duplication_rate_of_mapped
  • r_rna (not r_rna_rate, that is in both already)
  • percent gc
  • mapped paired reads (not mapped reads, that is in both already)
  • average insert size

PCA

I have something that might be useful as an option to include for PCA: I always look at PCA labeled by all metadata and metrics before narrowing down/finalizing what I include in the report for the client. It's not always informative but sometimes will find some unexpected confounding variable in the data that's worth reporting back and/or considering in the analysis going forward. I wrote a function to automatically generate plots for a user-provided list of metadata/metrics over any range of principal components, provided that the metadata/metrics are joined together. I can provide this function if we think it would be useful to include.

DEG rdata annotation assumes human reference

DEG template, load_counts_data chunk to create rdata object from AnnotationDbi

rdata = AnnotationDbi::select(org.Hs.eg.db, rownames(counts), 'SYMBOL', 'ENSEMBL') %>% dplyr::select(gene_id = ENSEMBL, gene_name = SYMBOL)
Assumes human reference

Should base reference used off of genome set in parameters at top of file

If using mouse reference, need the following:

  1. In libraries:
    # BiocManager::install("org.Mm.eg.db")
    library(org.Mm.eg.db)

  2. In load_counts_data chunk to create rdata object
    rdata = AnnotationDbi::select(org.Mm.eg.db, rownames(counts), 'SYMBOL', 'ENSEMBL') %>% dplyr::select(gene_id = ENSEMBL, gene_name = SYMBOL)

Viridis for Heatmap

Viridis would be both color-blind friendly and more aesthetically pleasing/easier to parse for non-colorblind people

DE_template_example.Rmd - subset discussion

DE_template_example.Rmd
• Consider changing template to not subset the data by default:
○ DESeq best practice is to avoid subsetting the data: "Typically, we recommend users to run samples from all groups together, and then use the contrast argument of the results function to extract comparisons of interest after fitting the model using DESeq."
○ source: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#if-i-have-multiple-groups-should-i-run-all-together-or-split-into-pairs-of-groups
§ One thing to note however is that apeglm shrinkage is incompatible with contrasts, so depending on the individual study and what comparisons need to be made, shrinkage method may need to be swapped for ashr, which is compatible with contrasts

Enrichment analysis flexibility

Over-representation analysis options should ideally include splitting the analysis by up-regulated and down-regulated significantly differently expressed genes; it's usually not as informative to use all significant genes together
• More options for over-rep analysis (clusterprofiler, enrichgo) might be nice

Adapt DE report to nf-core

Add variables to params_de.R for nf-core files.

Create a R function file that would be loaded by DEG.Rmd with a function that will create the count and metrics variable that will be used by DEG.Rmd code downstream.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.