Giter Site home page Giter Site logo

mulea-old's Introduction

MulEA: an R package for Multi-Enrichment Analysis

MulEA

Installing the MulEA package using devtools

library(devtools)
install_github("https://github.com/koralgooll/MulEA.git")

An example of how to use the MulEA package

The data set to analyse

  • Analysed microarray data from NCBI GEO database: GSE55662
  • It was published by Méhi et al. (2014) in Molecular Biology and Evolution.
  • The authors studied the evolution of antibiotic resistance in Escerichia coli bacteria.
  • They treated the bacteria with ciprofloxacin antibiotic and measured the gene expression changes.
  • During the differential expression analysis using the GEO2R, the following comparison were made:
    • Non-treated wild-type control samples (2 biological replicates) vs.
    • Wild-type samples treated with ciprofloxacin (2 biological replicates)

Reading the table containing the results of the differential expression analysis:

library(MulEA)
library(tidyverse)

Geo2R_result_tab <- read_tsv("GSE55662.table_wt_non_vs_cipro.tsv")

Let's see the first 3 rows of the Geo2R_result_tab data.frame:

ID adj.P.Val P.Value t B logFC Gene.symbol Gene.title
1765336_s_at 0.0186 2.4e-06 21.5 4.95769 3.70 gnsB Qin prophage; multicopy suppressor of secG(Cs) and fabA6(Ts)
1760422_s_at 0.0186 3.8e-06 19.6 4.68510 3.14 NA NA
1764904_s_at 0.0186 5.7e-06 18.2 4.43751 2.54 sulA///sulA///sulA///ECs1042 SOS cell division inhibitor///SOS cell division inhibitor///SOS cell division inhibitor///SOS cell division inhibitor

We need to format the data.frame before using it for enrichment analysis. This step is specific to the type of microarray has been used. Comment: positive logFC-s mean overexpression under ciprofloxacin treatment.

Geo2R_result_tab %<>% 
  # extracting the first gene symbol from the Gene.symbol column
  mutate(Gene.symbol = str_remove(string = Gene.symbol,
                                  pattern = "\\/.*")) %>% 
  # removing rows where Gene.symbol is NA
  filter(!is.na(Gene.symbol)) %>% 
  # ordering by logFC
  arrange(desc(logFC))

Let's see what did change in the first 3 rows of the Geo2R_result_tab data.frame:

ID adj.P.Val P.Value t B logFC Gene.symbol Gene.title
1765336_s_at 0.0186 2.40e-06 21.5 4.95769 3.70 gnsB Qin prophage; multicopy suppressor of secG(Cs) and fabA6(Ts)
1764904_s_at 0.0186 5.70e-06 18.2 4.43751 2.54 sulA SOS cell division inhibitor///SOS cell division inhibitor///SOS cell division inhibitor///SOS cell division inhibitor
1761763_s_at 0.0186 1.54e-05 15.0 3.73568 2.16 recN recombination and repair protein///recombination and repair protein///recombination and repair protein///recombination and repair protein

The database for the enrichment analysis

  • We were curious about which transcription factors regulated the expression of the significantly overexpressed genes.
  • Therefore, we used the MulEA package to perform multi-enrichment analysis on the Regulon database.
  • The GMT file containing genes symbols that are regulated by transcription factors was downloaded from the Github page of MulEA.

Reading the GMT file containing the lists of gene symbols each transcription factor (indicated with gene symbols as well) regulates:

Regulon_GMT <- read_gmt("RegulonDB_Escherichia_coli_genesymbol.gmt")

How many transcription factors are in the Regulon GMT file?

nrow(Regulon_GMT)

211

Let's see the first 3 rows of the Regulon_GMT data.frame:

ontologyId ontologyName listOfValues
AccB "AccB" accC, accB
AcrR "AcrR" marB, ma....
Ada "Ada" alkB, ad....

We have to mention that the in the Regulon GMT files both the ontologyId ans the ontologyName columns contain the gene symbols of the transcription factors. In the case of some other GMT files, i.e. the GO GMT files, the ontologyId column contains the GO IDs and the ontologyName column contains the GO terms.

The listOfValues lists of the gene symbols that are regulated by the transcription factor indicated in the ontologyId column. To see all such genes for example in the case of the transcription factor AcrR, we can use the following code:

Regulon_GMT$listOfValues[[which(Regulon_GMT$ontologyId == "AcrR")]]

marB marR marA acrB micF flhD acrR flhC acrA soxS soxR

Filtering the ontology entries

When interpreting the results of enrichment analyses, one may encounter the problem of the results being dominated by either overly specific or overly broad ontology entries being enriched. In MulEA, users can tailor the size of the ontology entries to their specific requirements, ensuring that the results match the expected scope.

Let's see the distribution of number of elements (gene symbols) in the listOfValues column to decide if we need to exclude too specific or too broad ontology entries:

Nr_of_elements_in_ontology <- Regulon_GMT$listOfValues %>% 
  map_dbl(length)

ggplot(mapping = aes(Nr_of_elements_in_ontology)) + 
  geom_bar() +
  theme_minimal()

We now see that there are some ontology entries containing more than 200 gene symbols. These transcription factors regulate a lot of genes, therefore not specific enough. We will exclude these from the enrichment analysis.

We also see that there are some ontology entries with only a small number of elements. Let's zoom in to this part of the distribution:

ggplot(mapping = aes(Nr_of_elements_in_ontology)) + 
  geom_bar() +
  xlim(0, 15) +
  theme_minimal()

Let's exclude the ontology entries containing less than 3 or more than 400 gene symbols and check the remaining number of transcription factors:

Regulon_GMT_filtered <- filter_ontology(gmt = Regulon_GMT,
                                        min_nr_of_elements = 3,
                                        max_nr_of_elements = 400)

How many transcription factors are in the filtered GMT object?

nrow(Regulon_GMT_filtered)

154

We even can save the filtered GMT object to a file:

write_gmt(gmt = Regulon_GMT_filtered, 
          file = "RegulonDB_Escherichia_coli_genesymbol_filtered.gmt")

Overrepresentation analysis (ORA)

Preparing input data for the ORA

Creating the "test" gene set

A vector containing the gene symbols of significantly overexpressed ($adjusted\ p-value &lt; 0.05$) genes with greater than 2 fold-change ($logFC &gt; 1$).

E.coli_sign_genes <- Geo2R_result_tab %>% 
  # filtering for adjusted p-value < 0.05 and logFC > 1
  filter(adj.P.Val < 0.05
         & logFC > 1) %>% 
  # selecting the Gene.symbol column
  select(Gene.symbol) %>% 
  # convert tibble to vector
  pull() %>% 
  # removing duplicates
  unique() %>% 
  # sorting
  sort()

Let's see the first 10 elements of the E.coli_sign_genes vector:

E.coli_sign_genes %>% 
  head(10)

accD ackA acnB agp appY aroC asnC bglA bioD btuB

Let's see the number of genes in the E.coli_sign_genes vector:

E.coli_sign_genes %>% 
  length()

241

Creating the "background" gene set

A vector containing the gene symbols of all genes were included in the differential expression analysis.

E.coli_background_genes <- Geo2R_result_tab %>% 
  # selecting the Gene.symbol column
  select(Gene.symbol) %>% 
  # convert tibble to vector
  pull() %>% 
  # removing duplicates
  unique() %>% 
  # sorting
  sort()

Let's see the number of genes in the E.coli_background_genes vector:

E.coli_background_genes %>% 
  length()

7381

Performing the ORA

Let's correct for multiple testing using the empirical FDR method with 10,000 permutations:

# creating the ORA model using the GMT variable
ora_model <- ora(gmt = Regulon_GMT_filtered, 
                 # the test gene set variable
                 element_names = E.coli_sign_genes, 
                 # the background gene set variable
                 background_element_names = E.coli_background_genes, 
                 # the p-value adjustment method
                 p_value_adjustment_method = "eFDR", 
                 # the number of permutations
                 number_of_permutations = 10000,
                 # the number of processor threads to use
                 number_of_cpu_threads = 4) 

# running the ORA
ora_results <- run_test(ora_model)

The results of the ORA

The ora_results is a data.frame containing the enriched transcription factors and the corresponding $p$ and $empirical\ FDR$ values.

Let's see the number of "enriched" transcription factors:

ora_results %>%
  filter(eFDR < 0.05) %>% 
  nrow()

10

Let's see the significant part of the ora_results data.frame:

ontology_id ontology_name nr_common_with_tested_elements nr_common_with_backgound_elements p_value eFDR
FNR "FNR" 26 259 0.0000003 0.0000000
LexA "LexA" 14 53 0.0000000 0.0000000
SoxS "SoxS" 7 37 0.0001615 0.0028000
DnaA "DnaA" 4 13 0.0006281 0.0051333
Rob "Rob" 5 21 0.0004717 0.0052400
FadR "FadR" 5 20 0.0003692 0.0054500
NsrR "NsrR" 8 64 0.0010478 0.0073286
ArcA "ArcA" 12 148 0.0032001 0.0204250
IHF "IHF" 14 205 0.0070758 0.0454700
MarA "MarA" 5 37 0.0066068 0.0480333

Visualizing the ORA results

Initializing the visualization:

ora_reshaped_results <- reshape_results(model = ora_model, 
                                        model_results = ora_results, 
                                        # choosing which column to use for the indication of significance
                                        p_value_type_colname = "eFDR")

Barplot -> Lollipop plot

The bars and their colouring show the significance levels of the enriched ontologies (transcription factors).

plot_barplot(reshaped_results = ora_reshaped_results,
             # the column containing the names we wish to plot
             ontology_id_colname = "ontology_id",
             # upper threshold for the value indicating the significance
             p_value_max_threshold = 0.05,
             # column that indicates the significance values
             p_value_type_colname = "eFDR")

Network plot

The function creates a network plot of the enriched ontologies (transcription factors). The nodes are the ontology IDs (Regulon IDs) coloured according to their significance level. Two nodes are connected if they have at least one shared element (gene which expression level was influenced by both of the transcription factor). The edges are weighted by the number of common elements between the nodes.

plot_graph(reshaped_results = ora_reshaped_results,
           # the column containing the names we wish to plot
           ontology_id_colname = "ontology_id",
           # upper threshold for the value indicating the significance
           p_value_max_threshold = 0.05,
           # column that indicates the significance values
           p_value_type_colname = "eFDR")

Heatmap

The actual elements (genes) of the enriched ontologies (transcription factors) connected with. The rows are the ontology IDs (Regulon IDs) coloured according to their significance level. The columns are the elements (genes) of the ontologies.

plot_heatmap(reshaped_results = ora_reshaped_results,
             # the column containing the names we wish to plot
             ontology_id_colname = "ontology_id",
             # column that indicates the significance values
             p_value_type_colname = "eFDR")

Gene set enrichment analysis (GSEA)

Preparing input data for the GSEA

A data.frame containing all the genes which expression were measured in the differential expression analysis and their log fold change values ($logFC$). (Or two vectors containing the gene symbols and the corresponding $logFC$ values.)

# if there are duplicated Gene.symbols keep the first one only
Geo2R_result_tab_filtered <- Geo2R_result_tab %>% 
  # grouping by Gene.symbol to be able to filter
  group_by(Gene.symbol) %>%
  # keeping the first row for each Gene.symbol from rows with the same Gene.symbol
  filter(row_number()==1) %>% 
  ungroup() %>% 
  # arranging by logFC in descending order
  arrange(desc(logFC)) %>%
  select(Gene.symbol, logFC)

Let's check the number of gene symbols in the E.coli_ordered_genes vector:

Geo2R_result_tab_filtered %>% 
  nrow()

7381

Performing the GSEA

Let's correct for multiple testing using the empirical FDR method with 10,000 permutations:

gsea_model <- gsea(gmt = Regulon_GMT_filtered,
                   element_names = Geo2R_result_tab_filtered$Gene.symbol,
                   element_scores = Geo2R_result_tab_filtered$logFC,
                   # consider elements having positive logFC values only
                   element_score_type = "pos",
                   number_of_permutations = 10000)

gsea_results <- run_test(gsea_model)

Visualizing the GSEA results

Initializing the visualization:

gsea_reshaped_results <- reshape_results(model = gsea_model, 
                                         model_results = gsea_results, 
                                         model_ontology_col_name = "ontologyId",
                                         ontology_id_colname = "ontologyId",
                                         # choosing which column to use for the indication of significance
                                         p_value_type_colname = "adjustedPValue")

Barplot -> Lollipop plot

The bars and their colouring show the significance levels of the enriched ontologies (transcription factors).

plot_barplot(reshaped_results = gsea_reshaped_results,
             # the column containing the names we wish to plot
             ontology_id_colname = "ontology_id",
             # upper threshold for the value indicating the significance
             p_value_max_threshold = 0.05,
             # column that indicates the significance values
             p_value_type_colname = "adjustedPValue")

Network plot

The function creates a network plot of the enriched ontologies (transcription factors). The nodes are the ontology IDs (Regulon IDs) coloured according to their significance level. Two nodes are connected if they have at least one shared element (gene which expression level was influenced by both of the transcription factor). The edges are weighted by the number of common elements between the nodes.

plot_graph(reshaped_results = gsea_reshaped_results,
           # the column containing the names we wish to plot
           ontology_id_colname = "ontology_id",
           # upper threshold for the value indicating the significance
           p_value_max_threshold = 0.05,
           # column that indicates the significance values
           p_value_type_colname = "adjustedPValue")

Heatmap

The actual elements (genes) of the enriched ontologies (transcription factors) connected with. The rows are the ontology IDs (Regulon IDs) coloured according to their significance level. The columns are the elements (genes) of the ontologies.

plot_heatmap(reshaped_results = gsea_reshaped_results,
             # the column containing the names we wish to plot
             ontology_id_colname = "ontology_id",
             # column that indicates the significance values
             p_value_type_colname = "adjustedPValue")

Session info

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=pl_PL.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=pl_PL.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=pl_PL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Warsaw
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.2    
 [5] purrr_1.0.1     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
 [9] ggplot2_3.4.2   tidyverse_2.0.0 MulEA_0.99.10  

loaded via a namespace (and not attached):
 [1] fastmatch_1.1-3     gtable_0.3.3        xfun_0.39          
 [4] ggrepel_0.9.3       lattice_0.21-8      tzdb_0.4.0         
 [7] vctrs_0.6.3         tools_4.3.1         generics_0.1.3     
[10] parallel_4.3.1      fansi_1.0.4         pkgconfig_2.0.3    
[13] Matrix_1.6-1        data.table_1.14.8   lifecycle_1.0.3    
[16] compiler_4.3.1      farver_2.1.1        munsell_0.5.0      
[19] ggforce_0.4.1       fgsea_1.26.0        graphlayouts_1.0.0 
[22] codetools_0.2-19    htmltools_0.5.5     yaml_2.3.7         
[25] pillar_1.9.0        crayon_1.5.2        MASS_7.3-60        
[28] BiocParallel_1.34.2 viridis_0.6.3       tidyselect_1.2.0   
[31] digest_0.6.32       stringi_1.7.12      labeling_0.4.2     
[34] cowplot_1.1.1       polyclip_1.10-4     fastmap_1.1.1      
[37] grid_4.3.1          colorspace_2.1-0    cli_3.6.1          
[40] magrittr_2.0.3      ggraph_2.1.0        tidygraph_1.2.3    
[43] utf8_1.2.3          withr_2.5.0         scales_1.2.1       
[46] bit64_4.0.5         timechange_0.2.0    rmarkdown_2.25     
[49] igraph_1.5.0        bit_4.0.5           gridExtra_2.3      
[52] hms_1.1.3           evaluate_0.21       knitr_1.43         
[55] viridisLite_0.4.2   rlang_1.1.1         Rcpp_1.0.10        
[58] glue_1.6.2          tweenr_2.0.2        rstudioapi_0.14    
[61] vroom_1.6.3         jsonlite_1.8.7      R6_2.5.1           
[64] plyr_1.8.8         

mulea-old's People

Contributors

koralgooll avatar stitam avatar olbeimarton avatar wjurkowski avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.