Giter Site home page Giter Site logo

Comments (3)

JEFworks avatar JEFworks commented on August 17, 2024

Dear Teng,

Thanks for sharing your output and the well organized issue report.

The differences you see may be attributable to the number of permutations being done to estimate the empirical p-value in the gsea vs. bulk.gsea functions. We can show how the number of permutations relates to the estimated the empirical p-value by explicitly setting the number of permutations in the following example:

library(liger)
# load gene set
data("org.Hs.GO2Symbol.list")  
# get universe
universe <- unique(unlist(org.Hs.GO2Symbol.list))
# get a gene set
gs <- org.Hs.GO2Symbol.list[[1]]
# fake dummy example where everything in gene set is perfectly enriched
vals <- rnorm(length(universe), 0, 10)
names(vals) <- universe
vals[gs] <- rnorm(length(gs), 100, 10)
head(vals)  # look at vals

gsea(values=vals, geneset=gs, mc.cores=1, plot=TRUE, n.rand=1000)
bulk.gsea(values=vals, set.list=list(gs), mc.cores=1, n.rand=1000)$p.val
gsea(values=vals, geneset=gs, mc.cores=1, plot=TRUE, n.rand=10000)
bulk.gsea(values=vals, set.list=list(gs), mc.cores=1, n.rand=10000)$p.val

Note that with 1000 permutations, the best we can estimate the true p-value is < 1/1000

> gsea(values=vals, geneset=gs, mc.cores=1, plot=TRUE, n.rand=1000)
[1] 0.001
> bulk.gsea(values=vals, set.list=list(gs), mc.cores=1, n.rand=1000)$p.val
[1] 0.000999001

With 10,000 permutations, we can get more precise to < 1/10,000.

> gsea(values=vals, geneset=gs, mc.cores=1, plot=TRUE, n.rand=10000)
[1] 1e-04
> bulk.gsea(values=vals, set.list=list(gs), mc.cores=1, n.rand=10000)$p.val
[1] 9.999e-05

Hope this helps!

Stay healthy and safe,
Jean

from liger.

teng-gao avatar teng-gao commented on August 17, 2024

Hi Jean,

Thanks for the detailed response!

I actually tried to set n.rand=1e4 and used the same random seed for both bulk and single gene set testing, however the discrepancy seemed to remain. I'm attaching the gene sets and logFC values below in case that is helpful to reproduce the issue. Interestingly, it also seems that the p values from bulk testing are consistently lower.

geneset_and_logfc.zip

from liger.

JEFworks avatar JEFworks commented on August 17, 2024

Dear Teng,

Thanks for sharing the geneset and logFC values you were using; very helpful in diagnostics.

I've compared the p-values from bulk.gsea, iterative.bulk.gsea and gsea and have found that if, rank=TRUE, then all 3 give the same results as expected:

> results <- iterative.bulk.gsea(logFC, set.list=geneset, rank=TRUE)
initial: [1e+02 - 20] [1e+03 - 8] [1e+04 - 5] done
> sig <- rownames(results)[results$q.val < 0.05]
> results[sig,]
                               p.val       q.val    sscore         edge
ANDROGEN_RESPONSE         0.00379962 0.031663500 -1.104368  0.009336836
INTERFERON_ALPHA_RESPONSE 0.00009999 0.002499750 -1.448063 -0.173254577
INTERFERON_GAMMA_RESPONSE 0.00069993 0.006999300 -1.254465 -0.030810312
MTORC1_SIGNALING          0.00689931 0.043120688  1.034815 -0.280169781
MYC_TARGETS_V1            0.00009999 0.002499750  1.775444 -0.264572926
OXIDATIVE_PHOSPHORYLATION 0.00029997 0.004999500  1.348529 -0.280169781
PROTEIN_SECRETION         0.00049995 0.006249375 -1.314850  0.015286103
UNFOLDED_PROTEIN_RESPONSE 0.00569943 0.040710215  1.065328 -0.269198663
> results.test <- bulk.gsea(logFC, set.list=geneset[sig], n.rand=1e4, rank=TRUE)
> results.test
                               p.val       q.val    sscore         edge
ANDROGEN_RESPONSE         0.00379962 0.006400000 -1.104368  0.009336836
INTERFERON_ALPHA_RESPONSE 0.00009999 0.001100000 -1.448063 -0.173254577
INTERFERON_GAMMA_RESPONSE 0.00069993 0.001966667 -1.254465 -0.030810312
MTORC1_SIGNALING          0.00689931 0.013975000  1.034815 -0.280169781
MYC_TARGETS_V1            0.00009999 0.000100000  1.775444 -0.264572926
OXIDATIVE_PHOSPHORYLATION 0.00029997 0.001550000  1.348529 -0.280169781
PROTEIN_SECRETION         0.00049995 0.001450000 -1.314850  0.015286103
UNFOLDED_PROTEIN_RESPONSE 0.00569943 0.013975000  1.065328 -0.269198663
> results.ind <- do.call(rbind, lapply(sig, function(gs) {
+   gsea(values = logFC, geneset = geneset[[gs]], main=gs, n.rand=1e4, return.details=TRUE, rank=TRUE)
+ }))
> rownames(results.ind) <- sig
> results.ind[, c(1,2,4,3)]
                           p.val edge.score scaled.score   edge.value
ANDROGEN_RESPONSE         0.0038  -2185.562     1.104368  0.009336836
INTERFERON_ALPHA_RESPONSE 0.0001  -2797.468     1.448063 -0.173254577
INTERFERON_GAMMA_RESPONSE 0.0007  -1794.328     1.254465 -0.030810312
MTORC1_SIGNALING          0.0069   1363.912     1.034815 -0.280169781
MYC_TARGETS_V1            0.0001   2248.408     1.775444 -0.264572926
OXIDATIVE_PHOSPHORYLATION 0.0003   1673.174     1.348529 -0.280169781
PROTEIN_SECRETION         0.0005  -2448.887     1.314850  0.015286103
UNFOLDED_PROTEIN_RESPONSE 0.0057   1794.249     1.065328 -0.269198663

However, when rank=FALSE, and the actual inputted values are used, the gsea function does indeed give the incorrect p-value (due to a discrepancy in how the scaled score is calculated), resulting in difference you saw:

> results <- iterative.bulk.gsea(logFC, set.list=geneset, rank=FALSE)
initial: [1e+02 - 21] [1e+03 - 10] [1e+04 - 5] done
> sig <- rownames(results)[results$q.val < 0.05]
> results[sig,]
                               p.val      q.val    sscore       edge
ADIPOGENESIS              0.00019998 0.00333300  1.346984 -0.7797408
INTERFERON_ALPHA_RESPONSE 0.00139986 0.01166550 -1.189337 -0.4057007
INTERFERON_GAMMA_RESPONSE 0.00079992 0.00799920 -1.281145 -0.2995351
MTORC1_SIGNALING          0.00569943 0.04071021  1.069090 -0.6127451
MYC_TARGETS_V1            0.00009999 0.00249975  1.497201 -0.9431348
OXIDATIVE_PHOSPHORYLATION 0.00009999 0.00249975 -1.746883  0.5227934
PROTEIN_SECRETION         0.00059994 0.00749925 -1.285763  0.2966188
> results.test <- bulk.gsea(logFC, set.list=geneset[sig], n.rand=1e4, rank=FALSE)
> results.test
                               p.val    q.val    sscore       edge
ADIPOGENESIS              0.00019998 0.000750  1.346984 -0.7797408
INTERFERON_ALPHA_RESPONSE 0.00139986 0.002850 -1.189337 -0.4057007
INTERFERON_GAMMA_RESPONSE 0.00079992 0.001700 -1.281145 -0.2995351
MTORC1_SIGNALING          0.00569943 0.008875  1.069090 -0.6127451
MYC_TARGETS_V1            0.00009999 0.000300  1.497201 -0.9431348
OXIDATIVE_PHOSPHORYLATION 0.00009999 0.000000 -1.746883  0.5227934
PROTEIN_SECRETION         0.00059994 0.001700 -1.285763  0.2966188
> results.ind <- do.call(rbind, lapply(sig, function(gs) {
+   gsea(values = logFC, geneset = geneset[[gs]], main=gs, n.rand=1e4, return.details=TRUE, rank=FALSE)
+ }))
> rownames(results.ind) <- sig
> results.ind[, c(1,2,4,3)]
                           p.val edge.score scaled.score edge.value
ADIPOGENESIS              0.0010   2494.709     1.242194 -0.7797408
INTERFERON_ALPHA_RESPONSE 0.0034  -3127.154     1.119215 -0.4057007
INTERFERON_GAMMA_RESPONSE 0.0015  -2463.756     1.183261 -0.2995351
MTORC1_SIGNALING          0.0039   2099.706     1.105981 -0.6127451
MYC_TARGETS_V1            0.0001   2674.405     1.455259 -0.9431348
OXIDATIVE_PHOSPHORYLATION 0.0001  -2919.925     1.581537  0.5227934
PROTEIN_SECRETION         0.0010  -3142.772     1.195370  0.2966188

Thanks for the catch. For now, please ignore the visualized p-values from the gsea function when rank=FALSE.

Generally, since you are testing many gene sets, if you intend to display these results, I would actually strongly advise displaying the multiple-testing corrected q-values from bulk testing instead of raw p-values.

Thanks again.

Stay healthy and safe,
Jean

from liger.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.