Comments (2)
First, if you are obtaining such high p-values for clearly distinct distributions, maybe there is a bug in the code or maybe you are calling the method with wrong parameters, because that should not happen. Can you provide an example of how are you using the method?
As for the explanation and understanding, the complete procedure is explained in the original article of Székely and Rizzo.
I will summarize the method:
- The null hypothesis is that the two samples have the same distribution. The alternative hypothesis is that the distribution is different (it does not matter how).
- In the article they prove that the expected energy statistic (
energy_test_statistic
in the code) between two samples converge if the samples have the same distribution but tends to infinity (when the size of the samples grow) if they have different distributions. - So, we will discard the null hypothesis if the energy statistic is "too high". But, how do we measure if it is "too high"? Because our samples have a finite size, the statistic will not be near infinity.
- Here is where we use the idea of a permutation test. Essentially, under the null hypothesis, all the observations come from the same distribution. Thus, if we permute the observations, so that now some observations may switch to a different sample, under the null hypothesis, the energy statistic would be similar to the original one: there is no reason for the original one to be special.
- However, under the alternative hypothesis, the samples obtained from the permutation come from a common distribution, which is a mixture of the original distributions of each sample. However, when we computed the original statistic, each sample had a different distribution. Thus, it is expected that the original statistic would be larger in this case than the statistics obtained by the permutations.
- Thus, we can perform a lot of random permutations (the number of permutations is the parameter
num_resamples
). We then compare the statistics obtained with the original one, obtaining the proportion of statistics larger than the original. This proportion is the estimated p-value. - Under the alternative hypothesis, this p-value should be very small, as the statistic should be more extreme for the original data. Under the null hypothesis, the original p-value is not speciall in any way, so this p-value would be distributed uniformly between 0 and 1. The probability that this p-value is less than α is exactly α. Thus, if we discard the null hypothesis when the p-value is less than 0.05, we will wrongly discard the null hypothesis one time every 20 times.
from dcor.
I will close this as there is no answer from @srujan741.
from dcor.
Related Issues (20)
- Question: is there a fast method for `dcor.independence.distance_covariance_test` HOT 2
- OSError: [Errno 36] File name too long when importing dcor HOT 5
- Is there a fast way of doing pairwise distance correlation (dcor.distance_correlation) HOT 8
- __version__ returns 0.0. Version number is on a separate file HOT 6
- AttributeError: 'float' object has no attribute 'dtype' HOT 1
- Process killed due to very large array HOT 2
- FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\domin\\PycharmProjects\\Trading_Backtesting_ML\\venv\\lib\\site-packages\\dcor\\__pycache__\\_fast_dcov_mergesort._generate_distance_covariance_sqr_mergesort_generic_impl.locals._distance_covariance_sqr_mergesort_generic_impl-163.py38.nbi.tmp.4ae6be2f415b45ff' HOT 2
- Improve performance of pairwise distances computation
- Add goodness-of-fit tests
- Add distance skewness and symmetry test
- Implement distance components (DISCO)
- Study and implement energy-based clustering
- Implement energy distance in terms of distance covariance
- Adding support for python 3.7 HOT 1
- Question about the shape of the input array HOT 3
- Can dcor with method 'AVL' or 'megresort' is applicable between two data types float and integer, respectively or it always has to be float? HOT 13
- Can distance correlation-based t test is theoretically correct to implement for "uni"-dimensional data? HOT 2
- Seemingly incorrect results with `int` datatype HOT 3
- Incorrect documentation about arbitrary dimensions HOT 2
- Maybe, but does the general code not work in that case?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dcor.