kaleidophon / deep-significance Goto Github PK

Enabling easy statistical significance testing for deep neural networks.

Home Page: https://deep-significance.rtfd.io/en/latest/

License: GNU General Public License v3.0

Python 67.18% Shell 0.16% Jupyter Notebook 32.65%

significance-testing deep-learning dl hypothesis-testing hypothesis-tests statistical-significance statistical-significance-test machine-learning ml deeplearning

deep-significance's People

Contributors

Stargazers

Watchers

deep-significance's Issues

Impact of sample size

Hi,

I'm doing some (very) small scale experimentation with your package and there's something I'm not clear about. How does the sample size(s) effect the statistical significance returned by your function (min_eps)?
I don't mean that as a general question, rather about the correct usage of your package. For example, why would aso([8], [5, 5, 8, 7, 8]) return 0.0858? can I really conclude that algorithm 1 is better than algorithm 2 based on a single sample from algorithm 1's results?
Another example would be aso([10, 8], [5, 5, 8, 7, 8]) -> 0.0367. Again, I'm convinced I can conclude that algorithm 1 is better than algorithm2 based on such as small sample of results. I would expect to be "asked" to run more experiments to provide more results to be used in the statistical test.

So in short, I'm asking if your function takes into account the sample size(s) when it calculates the significance score (min_eps)? Am I missing something here? If I do, please feel free to correct me.

Any clarification would be appreciated, thanks!
Ran

Create `deepsig.sample_size` module

Create function for bootstrap power analysis (see Henderson et al., 2018; Yuan & Hayashi, 2003)
Create function to compute reduction in uncertainty for violation ratio estimate (based on eq. 9 in del Barrio et al., 2018)
Implement unit tests
Update documentation
Publish new version 1.2.0

Sample-level random seed test

Hi,

First of all, thanks a lot for your work. It is exactly what I was looking for :)

I wondered if it is possible to compare multiple runs of the two models, A and B, on sample level rather than on score level? So let's say you trained each model five times with different random seeds. Does it make sense that two tests each run of A against each run of B and then average all the epsilons?

Holm-Bonferroni vs Bonferroni

Hello, I am not as familiar with statistical testing as I ought to be, but I found this package interesting. Is the Bonferonni statistic used here actually the Holm-Bonferroni statistic? The code matches the definition used by wikipedia, as linked before.

deep-significance/deepsig/correction.py

Line 81 in 21f9251

p_partial_u = (N - u + 1) * sorted_p_values[u - 1]

Great package by the way!

Source package on pypi missing requirements.txt

Doubt on how tu use ASO

@Kaleidophon Good evening, sorry i'm not sure i understood how the ASO function works.
For example if i run:
"
my_model_scores = scores_AUROC_Resnet
baseline_scores = scores_AUROC_Mobilenet

min_eps = aso(my_model_scores, baseline_scores, seed=seed, show_progress=False, confidence_score 0.95)"
from what i understood min_eps should be the upper bound to the amount of violation of the stochastic order.
What i don't understand is how the samples F* and G* are extracted. I mean in the original paper it says inverse transform sampling is used. While as far as i understood in your paper on this repository it is stated in (3) that these samples are obtained bootstrapping. Does this mean the same thing? Are they inverselly sampled casually or is a bootstrapping involving a constant similar to the one used in power analysis used? maybe i'm just confusing terms and they mean the same thing.
Thank you in advance and have a good evening

Misaligned diagonals for DataFrame

Hi,

Love this repo, thanks for doing this!

Issue

I have a small issue with respect to the new DataFrame feature that you implemented. Here I have misaligned diagonals where I would get close to stochastic dominance of a model compared to itself.

Reproduce Issue

I have the following dictionary:
d = {'x': array([59.13, 58.03, 59.18, 58.78, 58.5 ]), 'y': array([58.13, 59.19, 59.94, 60.08, 59.85]), 'z': array([58.77, 58.86, 59.58, 59.59, 59.64]), 'w': array([58.16, 58.49, 59.87, 58.94, 58.96])}

I use the following line of code:
print(multi_aso(d, confidence_level=0.05, return_df=True))

I get the following result:

    x         y            z            w
x    1.000000  1.000000     0.202027       0.0
y    1.000000  0.101093     0.000000       0.0
z    0.202027  0.000000     1.000000       0.0
w    0.000000  0.000000     0.000000       1.0

Where I think the diagonal for the (y, y) pair shouldn't be correct.

Thanks for reading!

kaleidophon / deep-significance Goto Github PK

deep-significance's People

Contributors

Stargazers

Watchers

Forkers

deep-significance's Issues

Impact of sample size

Create `deepsig.sample_size` module

Sample-level random seed test

Holm-Bonferroni vs Bonferroni

Source package on pypi missing requirements.txt

Doubt on how tu use ASO

Misaligned diagonals for DataFrame

Issue

Reproduce Issue

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent