igordot / msigdbr Goto Github PK

View Code? Open in Web Editor NEW

68.0 9.0 12.0 63.84 MB

MSigDB gene sets for multiple organisms in a tidy data format

Home Page: https://igordot.github.io/msigdbr

License: Other

R 100.00%

genomics msigdb gene-sets pathways gsea pathway-analysis enrichment-analysis

msigdbr's Introduction

R packages:

tools:

notes:

other collaborative repos:

msigdbr's People

Contributors

Stargazers

Watchers

Forkers

vreuter jchenpku acastanza lianos matthew-paul-2006 neuro-x1 tylersagendorf hyacinthmeng rnaimehaom ozturan zhsh006 jliu678

msigdbr's Issues

number of gene sets

Hi
I am using a package version of msigdbr_7.5.1, and wonder why the number of gene sets is smaller than that listed on the website. E.g., msigdbr_collections() says that there are 1615 reactome pathways, but the website https://www.gsea-msigdb.org/gsea/msigdb/human/genesets.jsp?collection=CP:REACTOME says it is 1635? Thanks!

msigdbr package, category C2, subcategory CP

Hello,
I'm currently running a gsea using msigdbr package.
I've noticed that subcategory CP of category C2 only contains 29 gene sets as displayed by msigdbr(collections), whereas this subcategory should include all of the depending gene sets (KEGG, reactome, wikipthways,...) and originally contains 2982 gene sets, as detailed on the original website : http://www.gsea-msigdb.org/gsea/msigdb/genesets.jsp?collection=CP

Any recommendations to run all of these gene sets depending on CP subcategory?

Thank you!

Archived genesets

Hello!

Is there any way msigdbr could be used to access archived genesets? i.e. those belonging to the "ARCHIVED" collection such as PENG_GLUTAMINE_DEPRIVATION_DN

Best,
Henry

Save the 'entrez_gene' columns in character mode

First thanks for this great package! Especially it directly outputs three different gene ID types, which saves a lot of time when switching between different gene ID types.

I have a small suggestion. Here in the output table, columns related to "entrez_gene" are stored as integers. I would suggest to change to characters, as what other Bioconducror annotation package does (e.g. org.Hs.eg.db).

gene_sets
# A tibble: 8,209 × 15
   gs_cat gs_su…¹ gs_name gene_…² entre…³ ensem…⁴ human…⁵ human…⁶ human…⁷ gs_id gs_pmid gs_ge…⁸
   <chr>  <chr>   <chr>   <chr>     <int> <chr>   <chr>     <int> <chr>   <chr> <chr>   <chr>  
 1 H      ""      HALLMA… ABCA1        19 ENSG00… ABCA1        19 ENSG00… M5905 267710… ""     
 2 H      ""      HALLMA… ABCB8     11194 ENSG00… ABCB8     11194 ENSG00… M5905 267710… ""     
 3 H      ""      HALLMA… ACAA2     10449 ENSG00… ACAA2     10449 ENSG00… M5905 267710… ""     
 4 H      ""      HALLMA… ACADL        33 ENSG00… ACADL        33 ENSG00… M5905 267710… ""     
 5 H      ""      HALLMA… ACADM        34 ENSG00… ACADM        34 ENSG00… M5905 267710… ""     
 6 H      ""      HALLMA… ACADS        35 ENSG00… ACADS        35 ENSG00… M5905 267710… ""     
 7 H      ""      HALLMA… ACLY         47 ENSG00… ACLY         47 ENSG00… M5905 267710… ""     
 8 H      ""      HALLMA… ACO2         50 ENSG00… ACO2         50 ENSG00… M5905 267710… ""     
 9 H      ""      HALLMA… ACOX1        51 ENSG00… ACOX1        51 ENSG00… M5905 267710… ""     
10 H      ""      HALLMA… ADCY6       112 ENSG00… ADCY6       112 ENSG00… M5905 267710… ""     
# … with 8,199 more rows, 3 more variables: gs_exact_source <chr>, gs_url <chr>,
#   gs_description <chr>, and abbreviated variable names ¹gs_subcat, ²gene_symbol,
#   ³entrez_gene, ⁴ensembl_gene, ⁵human_gene_symbol, ⁶human_entrez_gene, ⁷human_ensembl_gene,
#   ⁸gs_geoid
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Imagine we want to convert Entrez IDs to Refseq IDs, and we have a mapping vector (map) where Entrez IDs are the names and Refseq IDs are the values. Then naturally, to convert, we can do:

map[gene_sets$entrez_gene]

This causes the problem because gene_sets$entrez_gene are integers and it is actually treated as numeric indices for the map vector, while not to match to the names in map.

To do it correctly, we need to explicitly convert gene_sets$entrez_gene to characters:

map[as.character(gene_sets$entrez_gene)]

The more severe consequence is, if the maximal numeric value in gene_sets$entrez_gene is smaller than the length of map, executing map[gene_sets$entrez_gene] actually will not generate any warning or error message. And it would generate wrong results silently.

Some orthologs are missing

Hi,

I am trying to use msigdbr for a GSEA analysis for the GENESET - HSF1_01 in MSigDB.

Now this geneset contains a gene SHFM3 in MSigDB but it is missing in your list of orthologs for the same geneset.

I did a search for this gene - https://uswest.ensembl.org/Multi/Search/Results?q=SHFM3;site=ensembl

And found out that this gene has an alias/synonym - FBXW4 (as shown here - > https://uswest.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000107829;r=10:101610664-101695295 )

This particular alias (FBXW4) does have ORTHOLOG information for mus musculus (Fbxw4) as shown at - https://uswest.ensembl.org/Homo_sapiens/Gene/Compara_Ortholog?db=core;g=ENSG00000107829;r=10:101610664-101695295

There are many such cases and I was wondering if that is intentional or could be fixed in the future releases?

Much appreciate!

Ashu

getting error

Hello and thank you for your work,

I have this piece of code

library(msigdbr)

all_gene_sets <- msigdbr(species = "Mus musculus")
head(all_gene_sets)

but I am having the following error:

Error in parse(text = elt): <text>:1:5: simbolo inatteso
1: Use of
        ^
Traceback:

1. msigdbr(species = "Mus musculus")
2. orthologs(genes = genesets_subset$human_ensembl_gene, species = species) %>% 
 .     select(-any_of(c("human_symbol", "human_entrez"))) %>% rename(human_ensembl_gene = .data$human_ensembl, 
 .     gene_symbol = .data$symbol, entrez_gene = .data$entrez, ensembl_gene = .data$ensembl, 
 .     ortholog_sources = .data$support, num_ortholog_sources = .data$support_n)
3. rename(., human_ensembl_gene = .data$human_ensembl, gene_symbol = .data$symbol, 
 .     entrez_gene = .data$entrez, ensembl_gene = .data$ensembl, 
 .     ortholog_sources = .data$support, num_ortholog_sources = .data$support_n)
4. rename.data.frame(., human_ensembl_gene = .data$human_ensembl, 
 .     gene_symbol = .data$symbol, entrez_gene = .data$entrez, ensembl_gene = .data$ensembl, 
 .     ortholog_sources = .data$support, num_ortholog_sources = .data$support_n)
5. tidyselect::eval_rename(expr(c(...)), .data)
6. rename_impl(data, names(data), as_quosure(expr, env), strict = strict, 
 .     name_spec = name_spec, allow_predicates = allow_predicates, 
 .     error_call = error_call)
7. eval_select_impl(x, names, {
 .     {
 .         sel
 .     }
 . }, strict = strict, name_spec = name_spec, type = "rename", allow_predicates = allow_predicates, 
 .     error_call = error_call)
8. with_subscript_errors(out <- vars_select_eval(vars, expr, strict = strict, 
 .     data = x, name_spec = name_spec, uniquely_named = uniquely_named, 
 .     allow_rename = allow_rename, allow_empty = allow_empty, allow_predicates = allow_predicates, 
 .     type = type, error_call = error_call), type = type)
9. try_fetch(expr, vctrs_error_subscript = function(cnd) {
 .     cnd$subscript_action <- subscript_action(type)
 .     cnd$subscript_elt <- "column"
 .     cnd_signal(cnd)
 . })
10. withCallingHandlers(expr, vctrs_error_subscript = function(cnd) {
  .     {
  .         .__handler_frame__. <- TRUE
  .         .__setup_frame__. <- frame
  .     }
  .     out <- handlers[[1L]](cnd)
  .     if (!inherits(out, "rlang_zap")) 
  .         throw(out)
  . })
11. vars_select_eval(vars, expr, strict = strict, data = x, name_spec = name_spec, 
  .     uniquely_named = uniquely_named, allow_rename = allow_rename, 
  .     allow_empty = allow_empty, allow_predicates = allow_predicates, 
  .     type = type, error_call = error_call)
12. walk_data_tree(expr, data_mask, context_mask)
13. eval_c(expr, data_mask, context_mask)
14. reduce_sels(node, data_mask, context_mask, init = init)
15. walk_data_tree(new, data_mask, context_mask)
16. expr_kind(expr, context_mask, error_call)
17. call_kind(expr, context_mask, error_call)
18. lifecycle::deprecate_soft("1.2.0", what, details = cli::format_inline("Please use {.code {str}} instead of `.data${var}`"), 
  .     user_env = env)
19. signal_stage("deprecated", what)
20. spec(what, env = env)
21. spec_what(spec, "spec", signaller)
22. parse_expr(what)
23. parse_exprs(x)
24. chr_parse_exprs(x)
25. map(x, function(elt) as.list(parse(text = elt)))
26. lapply(.x, .f, ...)
27. FUN(X[[i]], ...)
28. as.list(parse(text = elt))
29. parse(text = elt)

Could you provide help to solve this issue?
Thank you in advance

SCSig collection

Dear @igordot

Thanks for the nice package!

Recent MSigDB provides SCSig collection: Signatures of Single Cell Identities
http://software.broadinstitute.org/gsea/msigdb/supplementary_genesets.jsp#SCSig
so I appreciate if you could extend this package to SCSig gene set.

Regards,

Koki

Ensembl Gene IDs

Are Ensembl gene sets supported?

I have just started using msigdbr and I cannot find any in the gene sets I have seen so far

Thanks!

Adding the "EXACT_SOURCE" column to the MsigDB C5 entries

Thanks for the very useful package,
would it be possible to add the EXACT_SOURCE attribute to GENESET record attributes for msigdb C5 gene sets? It would make it much easier to convert msigdb accession numbers into GO IDs. Thanks!

Accessing Mouse MSigDB Collections

Is there any possibility for the package to support collections that don't correspond to the human collections H, C1, ..., C8? For example accessing MH, M1, ..., M8 listed at the link below?

https://www.gsea-msigdb.org/gsea/msigdb/mouse/collections.jsp

Skip 7.3 CRAN release and go straight to 7.4(?)

It looks like the MSigDB v7.4 signature collection have been released before an msigdbr version for the v7.3 signatures has been pushed to CRAN. Maybe skip a v7.3 msigdbr release and go straight to the v7.4 signatures for the next CRAN release?

Thanks!

Run KEGG in Seurat object

@igordot @smped @vreuter @actions-user

Hello msigdbr team,

I am running GSEA analysis in 10X spatial and scRNA-seq data and I would like to use KEGG dataset.
Which function/category should I run?
For Hallmark, I run m_df<- msigdbr(species = "Homo sapiens", category = "H")

but category = "KEGG" does not work. I would greatly appreciate your advice.

Thank you.

enricher result is different from msigDB web "investigate Gene Sets"

Hi,

Many thanks for the msigdbr package.
Can I ask a question about the result of enricher please?

msigdbr_t2g = msigdbr_df %>% dplyr::select(gs_name, gene_symbol) %>% as.data.frame()
enricher(gene = gene_symbols_vector, TERM2GENE = msigdbr_t2g, ...)

I am using the code above but I've found the result of enriched msigDB signatures is different from "investigate gene sets" on msigDB website. I thought it's based on the number of the overlapped gene between the user's gene and the background gene in the gene set. But the overlapped gene count from enricher seems smaller than the real overlapped count (i.e. if I use intersect to see how many genes overlapped between mine and the msigdb gene set). Did i misunderstand the function of enricher here? And if possible, how can I get the same results to msigDB web?

Thanks in advance!

Best,
Wei

Retrieve all C2 canonical pathways using option subcategory = "CP"?

Dear Igordot,

Thanks for this wonderful tool! I understand it can be used to retrieve subcategory pathways by setting subcategory = "CP:KEGG". But I was wondering if I can extract all canonical pathways as follows:

library(msigdbr)
m_df = msigdbr(species = "Homo sapiens", category = 'C2', subcategory = 'CP')
length(unique(m_df$gs_name))
[1] 29

Looking forward to your comments!

Best,
Lei

`unused argument (.data$species_name == species)` error

Hi,
I've just got unused argument (.data$species_name == species) error, and I don't know how to proceed. Is it a bug or am I doing sth wrong?

> library(msigdbr)
> msigdbr(species = "Homo sapiens")
Error in filter.tbl(msigdbr_orthologs, .data$species_name == species) : 
  unused argument (.data$species_name == species)
> msigdbr(species = "Mus musculus", category = "C2", subcategory = "CGP")
Error in filter.tbl(msigdbr_orthologs, .data$species_name == species) : 
  unused argument (.data$species_name == species)
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /opt/R-4.0.2/lib64/R/lib/libRblas.so
LAPACK: /opt/R-4.0.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] msigdbr_7.2.1               DESeq2_1.28.1              
 [3] SummarizedExperiment_1.18.2 DelayedArray_0.14.1        
 [5] matrixStats_0.57.0          Biobase_2.48.0             
 [7] rtracklayer_1.48.0          genomation_1.20.0          
 [9] gProfileR_0.7.0             ChIPpeakAnno_3.22.4        
[11] Biostrings_2.56.0           XVector_0.28.0             
[13] VennDiagram_1.6.20          futile.logger_1.4.3        
[15] rGREAT_1.20.0               methylKit_1.14.2           
[17] GenomicRanges_1.40.0        GenomeInfoDb_1.24.2        
[19] IRanges_2.22.2              S4Vectors_0.26.1           
[21] BiocGenerics_0.34.0         gprofiler2_0.2.0           
[23] reshape2_1.4.4              ggplot2_3.3.2              
[25] gridExtra_2.3               data.table_1.13.0          
[27] biomaRt_2.44.4              igraph_1.2.6               
[29] STRINGdb_2.0.2             

loaded via a namespace (and not attached):
  [1] circlize_0.4.10          BiocFileCache_1.12.1     plyr_1.8.6              
  [4] lazyeval_0.2.2           splines_4.0.2            BiocParallel_1.22.0     
  [7] gridBase_0.4-7           digest_0.6.25            ensembldb_2.12.1        
 [10] htmltools_0.5.0          GO.db_3.11.4             magrittr_1.5            
 [13] memoise_1.1.0            BSgenome_1.56.0          limma_3.44.3            
 [16] annotate_1.66.0          readr_1.4.0              R.utils_2.10.1          
 [19] askpass_1.1              bdsmatrix_1.3-4          prettyunits_1.1.1       
 [22] colorspace_1.4-1         blob_1.2.1               rappdirs_0.3.1          
 [25] xfun_0.18                dplyr_1.0.2              crayon_1.3.4            
 [28] RCurl_1.98-1.2           jsonlite_1.7.1           graph_1.66.0            
 [31] genefilter_1.70.0        impute_1.62.0            survival_3.1-12         
 [34] glue_1.4.2               hash_2.2.6.1             gtable_0.3.0            
 [37] zlibbioc_1.34.0          seqinr_4.2-4             GetoptLong_1.0.3        
 [40] shape_1.4.5              scales_1.1.1             futile.options_1.0.1    
 [43] mvtnorm_1.1-1            DBI_1.1.0                Rcpp_1.0.5              
 [46] plotrix_3.7-8            xtable_1.8-4             viridisLite_0.3.0       
 [49] progress_1.2.2           emdbook_1.3.12           bit_4.0.4               
 [52] mclust_5.4.6             sqldf_0.4-11             htmlwidgets_1.5.2       
 [55] httr_1.4.2               gplots_3.1.0             RColorBrewer_1.1-2      
 [58] ellipsis_0.3.1           pkgconfig_2.0.3          XML_3.99-0.5            
 [61] R.methodsS3_1.8.1        farver_2.0.3             dbplyr_1.4.4            
 [64] locfit_1.5-9.4           tidyselect_1.1.0         labeling_0.3            
 [67] rlang_0.4.7              AnnotationDbi_1.50.3     munsell_0.5.0           
 [70] tools_4.0.2              gsubfn_0.7               generics_0.0.2          
 [73] RSQLite_2.2.1            ade4_1.7-15              fastseg_1.34.0          
 [76] evaluate_0.14            stringr_1.4.0            yaml_2.2.1              
 [79] knitr_1.30               bit64_4.0.5              caTools_1.18.0          
 [82] purrr_0.3.4              AnnotationFilter_1.12.0  RBGL_1.64.0             
 [85] formatR_1.7              R.oo_1.24.0              xml2_1.3.2              
 [88] compiler_4.0.2           rstudioapi_0.11          plotly_4.9.2.1          
 [91] curl_4.3                 png_0.1-7                geneplotter_1.66.0      
 [94] tibble_3.0.3             idr_1.2                  stringi_1.5.3           
 [97] GenomicFeatures_1.40.1   lattice_0.20-41          ProtGenerics_1.20.0     
[100] Matrix_1.2-18            multtest_2.44.0          vctrs_0.3.4             
[103] pillar_1.4.6             lifecycle_0.2.0          BiocManager_1.30.10     
[106] GlobalOptions_0.1.2      bitops_1.0-6             qvalue_2.20.0           
[109] R6_2.4.1                 KernSmooth_2.23-17       lambda.r_1.2.4          
[112] MASS_7.3-51.6            gtools_3.8.2             assertthat_0.2.1        
[115] chron_2.3-56             proto_1.0.0              openssl_1.4.3           
[118] rjson_0.2.20             withr_2.3.0              regioneR_1.20.1         
[121] GenomicAlignments_1.24.0 Rsamtools_2.4.0          GenomeInfoDbData_1.2.3  
[124] hms_0.5.3                tidyr_1.1.2              coda_0.19-4             
[127] rmarkdown_2.4            seqPattern_1.20.0        bbmle_1.0.23.1          
[130] numDeriv_2016.8-1.1      tinytex_0.26

Best,
Kasia

Inconsistent gene set contents with MSigDB

First, thanks for the great package! It's really convenient to be able to pull in these gene sets from MSigDB. I've been using it to pull gene sets for about a year now, and only recently noticed that some of the gene sets are different than what's on MSigDB (e.g., GOBP_Keratinization from msigdbr includes 279 genes, but on MSigDB it only has 83 genes).

I thought it might be a difference of versions (as msigdbr pulls MSigDB 7.5.1), but GOBP_Keratinization actually contains fewer genes in this version (n = 59): https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5.1/c5.go.bp.v7.5.1.symbols.gmt

I used this line to pull all GO BP sets:

m_df_BP = msigdbr(species = "Homo sapiens",subcategory=c("BP"))

here is my session info:

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats4 stats graphics grDevices
[6] datasets utils methods base

other attached packages:
[1] scales_1.1.1 msigdbr_7.4.1
[3] biomartr_0.9.2 data.table_1.14.0
[5] GSEABase_1.54.0 graph_1.70.0
[7] annotate_1.70.0 XML_3.99-0.6
[9] reactome.db_1.76.0 GO.db_3.13.0
[11] fgsea_1.18.0 dplyr_1.0.7
[13] EnhancedVolcano_1.10.0 ggrepel_0.9.1
[15] rlist_0.4.6.1 pheatmap_1.0.12
[17] org.Hs.eg.db_3.13.0 AnnotationDbi_1.54.1
[19] readxl_1.3.1 ggplot2_3.3.5
[21] ashr_2.2-47 DESeq2_1.32.0
[23] SummarizedExperiment_1.22.0 Biobase_2.52.0
[25] MatrixGenerics_1.4.0 matrixStats_0.59.0
[27] GenomicRanges_1.44.0 GenomeInfoDb_1.28.1
[29] IRanges_2.26.0 S4Vectors_0.30.0
[31] BiocGenerics_0.38.0 rmarkdown_2.14
[33] here_1.0.1

loaded via a namespace (and not attached):
[1] snow_0.4-3 circlize_0.4.14
[3] fastmatch_1.1-3 BiocFileCache_2.0.0
[5] splines_4.1.0 BiocParallel_1.26.1
[7] digest_0.6.27 invgamma_1.1
[9] foreach_1.5.2 htmltools_0.5.2
[11] SQUAREM_2021.1 fansi_0.5.0
[13] magrittr_2.0.1 memoise_2.0.0
[15] cluster_2.1.2 doParallel_1.0.17
[17] ComplexHeatmap_2.8.0 Biostrings_2.60.1
[19] extrafont_0.17 extrafontdb_1.0
[21] prettyunits_1.1.1 colorspace_2.0-2
[23] rappdirs_0.3.3 blob_1.2.2
[25] xfun_0.30 crayon_1.4.1
[27] RCurl_1.98-1.3 genefilter_1.74.0
[29] survival_3.3-1 iterators_1.0.14
[31] glue_1.6.2 gtable_0.3.0
[33] zlibbioc_1.38.0 XVector_0.32.0
[35] GetoptLong_1.0.5 DelayedArray_0.18.0
[37] proj4_1.0-10.1 Rttf2pt1_1.3.9
[39] shape_1.4.6 maps_3.3.0
[41] DBI_1.1.1 Rcpp_1.0.7
[43] progress_1.2.2 xtable_1.8-4
[45] clue_0.3-60 bit_4.0.4
[47] truncnorm_1.0-8 httr_1.4.2
[49] RColorBrewer_1.1-2 ellipsis_0.3.2
[51] pkgconfig_2.0.3 farver_2.1.0
[53] dbplyr_2.1.1 locfit_1.5-9.4
[55] utf8_1.2.1 tidyselect_1.1.1
[57] labeling_0.4.2 rlang_0.4.11
[59] munsell_0.5.0 cellranger_1.1.0
[61] tools_4.1.0 cachem_1.0.5
[63] cli_3.3.0 generics_0.1.0
[65] RSQLite_2.2.7 evaluate_0.14
[67] stringr_1.4.0 fastmap_1.1.0
[69] yaml_2.2.1 babelgene_21.4
[71] knitr_1.33 bit64_4.0.5
[73] purrr_0.3.4 KEGGREST_1.32.0
[75] ash_1.0-15 ggrastr_0.2.3
[77] xml2_1.3.2 biomaRt_2.48.2
[79] compiler_4.1.0 rstudioapi_0.13
[81] filelock_1.0.2 curl_4.3.2
[83] beeswarm_0.4.0 png_0.1-8
[85] tibble_3.1.3 geneplotter_1.70.0
[87] stringi_1.7.3 highr_0.10
[89] ggalt_0.4.0 lattice_0.20-45
[91] Matrix_1.3-4 vctrs_0.3.8
[93] pillar_1.6.1 lifecycle_1.0.0
[95] BiocManager_1.30.16 GlobalOptions_0.1.2
[97] bitops_1.0-7 irlba_2.3.3
[99] R6_2.5.0 renv_0.15.4
[101] KernSmooth_2.23-20 gridExtra_2.3
[103] vipor_0.4.5 codetools_0.2-19
[105] MASS_7.3-55 assertthat_0.2.1
[107] rprojroot_2.0.2 rjson_0.2.21
[109] withr_2.4.2 GenomeInfoDbData_1.2.6
[111] hms_1.1.0 grid_4.1.0
[113] Cairo_1.5-12.2 mixsqp_0.3-43
[115] tinytex_0.37 ggbeeswarm_0.6.0

Problem with loading several categories

In our work we often want to test our gene lists against several categories of gene sets at once.
Until now we would load the gene sets like this:

msigdb.genes.sets <-msigdbr(species="Homo sapiens", category=c("H","C2"))

We noticed that in doing so, the gene sets are truncated, with a remaining number of genes in a gene set varying with the number of categories or their order.
After looking at the R code it seems the problem is that the categories are filtered with an "==" and not a "%in%, which means we cannot use an array in our command. But no warning or error is thrown and everything downstream works, with background ratio values wrong obviously.

Would it be possible to correct this or to forbid requesting more than one category in the command?

Update to MSIGDB

Hello!

I was wondering if there were plans to synchronize msigdbr with the latest release of MSIGDB (aug 2019)? The new MSIGDB has added and removed hundreds of gene sets so I've been finding that the information pages for most of my top GSEA hits using msgidbr annotations no longer exist.

Thank you for your time!
Best,
Henry

No gene sets from KEGG, REACTOME or BIOCARTA

It looks like it's no longer possible to get gene sets from KEGG, REACTOME or BIOCARTA:

c2_reactome <- msigdbr(category = "C2", subcategory = "REACTOME") %>%
  split(x = .$gene_symbol, f = .$gs_name)
> length(c2_reactome)
[1] 0

Can these be restored? Thank you.

Methodology details, and `write.gmt` helper functions?

Hi I came across your package which could potentially save me a lot of work so I thank you.

Could you publish the details on your methods for converting between human to X species? I need this information in order to be able to cite you in my research.

Also will you consider adding helper functions to convert from the data.frame types to a type which can be easily written as a .gmt pathway file?

Function to query MSigDB database version used by msigdbr

It would be great to have a function to query MSigDB database version used by msigdbr

CP:WIKIPATHWAY

How can I retrieve CP:WIKIPATHWAY? (https://www.gsea-msigdb.org/gsea/msigdb/genesets.jsp?collection=CP:WIKIPATHWAYS)

Add shorter GO descriptions?

The entries in the gs_description column for GO terms are rather long and not ideal for use as human-readable identifiers when plotting ORA or GSEA results. Would it be possible to add a gs_brief_description column that uses the names from the appropriate GO database release? I have been getting the data using the code below and then left-joining it to ORA and GSEA results tables made with fgsea. For other databases, I just use the entries in gs_description.

# install.packages(c("ontologyIndex", "dplyr"))
library(ontologyIndex)
library(dplyr)

# Brief GO term descriptions (use same data from MSigDB release notes)
file <- "http://release.geneontology.org/2021-12-15/ontology/go-basic.obo"
go_basic_list <- get_OBO(file,
                         propagate_relationships = "is_a",
                         extract_tags = "minimal")

# Convert to data.frame with fewer columns
go_basic_df <- as.data.frame(go_basic_list) %>%
  filter(!obsolete) %>%
  select(pathway = id, name)

`> msigdbr(species = "Homo sapiens")
Error in `select()`:
! <text>:1:5: unexpected symbol
1: Use of
        ^
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<simpleError in select(., .data$human_ensembl_gene, gene_symbol = .data$human_gene_symbol,     entrez_gene = .data$human_entrez_gene): <text>:1:5: unexpected symbol
1: Use of
        ^>`

session info:

`> sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] EnrichmentBrowser_2.26.0    graph_1.74.0               
 [3] SummarizedExperiment_1.26.1 Biobase_2.56.0             
 [5] GenomicRanges_1.48.0        GenomeInfoDb_1.32.4        
 [7] IRanges_2.30.1              S4Vectors_0.34.0           
 [9] BiocGenerics_0.42.0         MatrixGenerics_1.8.1       
[11] matrixStats_0.63.0          msigdbr_7.5.1              
[13] fgsea_1.22.0                biomaRt_2.52.0             
[15] dplyr_1.0.10                clusterProfiler_4.4.4      `

Any ideas...?