Hi, I'm doing some (very) small scale experimentation with your pack

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Impact of sample size about deep-significance HOT 5 CLOSED

kaleidophon commented on May 29, 2024

Impact of sample size

from deep-significance.

Comments (5)

ranshadmi commented on May 29, 2024 1

Are you familiar with statsmodels.stats.power.TTestIndPower.solve_power()?
I realize that is probably just a fancy wrapper around a simple t-test, but given effect_size, alpha, power and the ratio between the two groups size, it output the "required" number of samples. That's what I'm using now and I was wondering how can I replace it with your function.

from deep-significance.

Kaleidophon commented on May 29, 2024

Hey @ranshadmi!

Thank you for your interest. Let me try to explain the relationship between sample size and the result better.
First of all, the results of the test are non-deterministic, since they based on a bootstrapping procedure. When you are playing around with the test, you can use the seed argument to fix the randomness.

Secondly, the final test score depends on two different part: The extent of the violation of the stochastic order based on the original two samples of scores, and an adjustment based on bootstrapped score samples (see eq. 3 in the paper). The violation of the stochastic order is being calculated based on the cumulative distribution functions of the two distribution of model scores. Since we do not have access to the true distributions, the empirical CDFs are used (see implementation here). When you only have a single for one of the samples, the empirical CDF for the first algorithm will essentially be a step function with a single step - you can imagine that this will not be very informative for the test. In addition, the second term includes a correction term for the sample sizes which can correct for this lack to some degree, and another variance term based on bootstrapped score samples. But since bootstrapping a single score will always lead to the same result, this will also not be very informative.

The bootstrapping is used in order to produce an upper bound to the test result. In that sense, the difference between your examples with [8] and [8, 10] is expected: Adding 10 does decrease the violation of the stochastic order, and adding another score sample makes the upper bound tighter. However, you are correct that this sample size is very low, and any conclusions from any statistical test with such a low sample size should be drawn with the appropriate grain of salt.

For this purpose, the package supplies two more functions: With aso_uncertainty_reduction(), you can compare by which factor the uncertainty about the true test result decreases by adding more scores. With bootstrap_power_analysis(), you can determine the statistical power, i.e. the complement to the Type II error, a false negative, of your sample. Ideally, the power should be around 0.8. These tools are meant to help you in the decision about how many scores to collect, but the rule of thumb always remains: The more, the better.

Hope that helped and let me know if you have any further questions!

from deep-significance.

ranshadmi commented on May 29, 2024

Thanks for your elaborate reply!
I still have some questions...

As for aso_uncertainty_reduction() - how should I interpret the returned number? what does "uncertainty reduction" of 1.1547005383792515 actually mean?
Should I just try a few pairs of values until I see some kind of saturation in the returned value? For example, 1 (old) -> 3 (new) will return a large number, while 3 -> 5 will return a small number - from that I can conclude that 3 is a good-enough number?

from deep-significance.

Kaleidophon commented on May 29, 2024

No problem! The number you get from aso_uncertainty_reduction() is the factor by which the amount of uncertainty around the estimate of the degree of violation of the stochastic order decreases as the sample size grows. This is based on the theorem 2.4 by del Barrio et al. (2017). You can see that the estimate of the violation (the term with F_n, G_m), approaches the true estimate (e_W2(F, G)) at a rate of sqrt(mn/(m + n)). Thus, adding more samples has diminishing returns. However, keep in mind that these are relative numbers - there is no (or at least, I don't know) good way to estimate by how much the current estimate is off. Thus, the function is supposed to help you decide whether adding more scores of algorithm A or B is more useful in reducing the uncertainty and making the test more accurate - "the more samples, the better" still holds.

from deep-significance.

Kaleidophon commented on May 29, 2024

No, it looks really cool! I don't think you really need to replace that mine, it seems more versatile. Mine just computes the power, and you can use an arbitrary test to do that, but that is pretty much it compared to the one you provided. Thanks for sharing!

from deep-significance.

Impact of sample size about deep-significance HOT 5 CLOSED

Comments (5)

Related Issues (7)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent