vnmabus / dcor Goto Github PK

View Code? Open in Web Editor NEW

134.0 134.0 25.0 409 KB

Distance correlation and related E-statistics in Python

Home Page: https://dcor.readthedocs.io

License: MIT License

Python 100.00%

distance-correlation python python2 python3 statistics

dcor's People

Contributors

Stargazers

Watchers

dcor's Issues

Implementation for custom distances?

Hi, very nice package!

Is it possible to use it with a distance correlation coming from a custom notion of distance between Xi/Xj or Yi/Yj (i.e. not necessarily an exponent of the Euclidean distance?). If not, would you know of an implementation available to do so?

Thank you!

Wrong formulas in documentation

On https://dcor.readthedocs.io/en/latest/theory.html, SQRTs are missing for the denominators in the R^2 formulas.

Implement energy distance in terms of distance covariance

We can implement energy distance in terms of distance covariance, as shown in https://arxiv.org/pdf/1910.08883.pdf.

We need to study:

How this affect the current parameters of energy distance.
How to allow users to optionally access the different implementations of distance covariance, as well as the old energy distance implementation (if needed).

Seemingly incorrect results with `int` datatype

While experimenting with this package, I encountered a strange issue and thought it would be useful to post about it here. In short, it appears that the distance_correlation computation for int dtypes is incorrect when the size of the data is sufficiently large.

Here is a minimal example that can be used to replicate the issue:

import numpy as np
from dcor import distance_correlation

def reproduce_error(n):
    # some simple data
    arr1 = np.array([
        1, 2, 3
    ]*n)
    arr2 = np.array([
        10, 20, 5
    ]*n)
    
    int_int = distance_correlation(
        arr1, arr2
    )
    float_int = distance_correlation(
        arr1.astype(float),
        arr2
    )
    int_float = distance_correlation(
        arr1,
        arr2.astype(float),
    )
    float_float = distance_correlation(
        arr1.astype(float),
        arr2.astype(float),
    )

    print(f"""
    n: {n}
    int vs int: {int_int}
    float vs int: {float_int}
    int vs float: {int_float}
    float vs float: {float_float}
    """)

Now when we run this code for small samples, the correlations for all dtypes agree, and do not substantially change with the sample size.

reproduce_error(1)
    n: 1
    int vs int: 0.7621991222319221
    float vs int: 0.7621991222319221
    int vs float: 0.7621991222319221
    float vs float: 0.7621991222319221

reproduce_error(10)
    n: 10
    int vs int: 0.7621991222319219
    float vs int: 0.7621991222319219
    int vs float: 0.7621991222319219
    float vs float: 0.7621991222319217

reproduce_error(100)
    n: 100
    int vs int: 0.7621991222319221
    float vs int: 0.7621991222319221
    int vs float: 0.7621991222319221
    float vs float: 0.7621991222319215

However, past a certain point, the computations diverge:

reproduce_error(10000)
    n: 10000
    int vs int: 0.890284163962155
    float vs int: 0.890284163962155
    int vs float: 0.7621991222319217
    float vs float: 0.7621991222317823

I've started casting everything to float before computing the correlations to avoid this issue.

Clarification of distance correlation - dcor vs scipy

Hi!

I have started using dcor as as I need to find pairwise correlations between two variables/vectors for every pairwise comparison in a dataframe. I am using the distance correlation as i need to find correlations not just for linear pairwise correlations but also non-linear correlations.

Having read the documentation, I know this is the correct implementation for this purpose, however, as I understand it, Scipy also provides a distance correlation function. I am getting different results when using both dcor and scipy and was wondering if you could explain why? I am unsure if Scipy is actually using the same distance correlation, or if their implementation contains something obvious I have missed which leads to the different results:

from scipy.spatial import distance
distance.correlation(data['column1'], data['column2'])
= 0.57

import dcor
dcor.distance_correlation(data['column1'], data['column2'])
= 0.41

There is a large discrepancy here and would appreciate clarification!

thank you!

Causes segmentation fault

In conda-based virtualenv python 3.7.2 importing dcor into project
causes crash with segmentation fault error for me.

OSError: [Errno 36] File name too long when importing dcor

Importing dcor failed due to file name too long.

Ubuntu 20.04
python 3.8.10
dcor 0.5.3
numba 0.53.1 (+ 0.54.1)

>>> import dcor
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/quentin/.local/lib/python3.8/site-packages/dcor/__init__.py", line 14, in <module>
    from . import independence  # noqa
  File "/home/quentin/.local/lib/python3.8/site-packages/dcor/independence.py", line 11, in <module>
    from ._dcor import u_distance_correlation_sqr
  File "/home/quentin/.local/lib/python3.8/site-packages/dcor/_dcor.py", line 26, in <module>
    from ._fast_dcov_mergesort import _distance_covariance_sqr_mergesort_generic
  File "/home/quentin/.local/lib/python3.8/site-packages/dcor/_fast_dcov_mergesort.py", line 208, in <module>
    _distance_covariance_sqr_mergesort_generic_impl_compiled = numba.njit(
  File "/home/quentin/.local/lib/python3.8/site-packages/numba/core/decorators.py", line 221, in wrapper
    disp.compile(sig)
  File "/home/quentin/.local/lib/python3.8/site-packages/numba/core/dispatcher.py", line 891, in compile
    cres = self._cache.load_overload(sig, self.targetctx)
  File "/home/quentin/.local/lib/python3.8/site-packages/numba/core/caching.py", line 644, in load_overload
    return self._load_overload(sig, target_context)
  File "/home/quentin/.local/lib/python3.8/site-packages/numba/core/caching.py", line 651, in _load_overload
    data = self._cache_file.load(key)
  File "/home/quentin/.local/lib/python3.8/site-packages/numba/core/caching.py", line 495, in load
    overloads = self._load_index()
  File "/home/quentin/.local/lib/python3.8/site-packages/numba/core/caching.py", line 511, in _load_index
    with open(self._index_path, "rb") as f:
OSError: [Errno 36] File name too long: '/home/quentin/.local/lib/python3.8/site-packages/dcor/__pycache__/_fast_dcov_mergesort._generate_distance_covariance_sqr_mergesort_generic_impl.locals._distance_covariance_sqr_mergesort_generic_impl-163.py38.nbi'

Adding support for python 3.7

It seems that the latest version of dcor does not support py37, and the latest version for py37 is 0.5.3.

Accelerate distance correlation and stats using rowwise

When the naive algorithm is not used, the computation of the distance stats (and thus, distance correlation), can be accelerated using rowwise to compute the distance covariance and distance variances in parallel whenever possible.

Question: is there a fast method for `dcor.independence.distance_covariance_test`

WIth reference to the exampel in this notebook, this weekend I compared the performance of the the MERGESORT method vs. the NAIVE with a toy dataset of 8 columns x 21 rows:

%%timeit
dc = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.distance_correlation(col1, 
                                                          col2, method = 'NAIVE'), axis = 0, arr=data), axis =0, arr=data)
>>> 24.3 ms ± 334 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

vs:

%%timeit
dc = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.distance_correlation(col1, 
                                                          col2, method = 'MERGESORT'), axis = 0, arr=data), axis =0, arr=data)
>>> 17.4 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Since i sometimes work with many thousands of rows, and possibly more columns, I wonder if there is a way to similarly improve the speed of the pairwise p-value calculation:

p = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.independence.distance_covariance_test(col1, 
                                                         col2, exponent=1.0, num_resamples=2000)[0], 
                                                         axis = 0, arr=data), axis =0, arr=data)
>>> 4.38 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Question about the shape of the input array

I have created an autoencoder as a feature extractor. To make the output of the encoder as independent as possible from the input. I chose dcor as an additional loss to train the autoencoder. However, I have some problems when calculating the dcor loss.
The input shape of the batch data is [32, 1, 28, 14] and the embedding shape is [32, 8, 14, 7]. But it seems that the calculation cannot be performed directly. The Error part is as follows:

I wonder if there have some way to calculate the dcor loss of the input and the embedding. Hope to get your answer :)

Can distance correlation-based t test is theoretically correct to implement for "uni"-dimensional data?

The whole t-test is based on the idea that the dimensions of X and Y must be high. What do you think about that?

Exponent implementation question

Hi again. I wondered if you could explain the logic here:

dcor/dcor/distances.py

Lines 56 to 57 in 161a6f5

    
           if exponent != 1: 
        
               distances **= exponent / 2

If, for example, we set exponent=2, which should test for mean differences only (James & Matteson, 2015), it seems that this line would raise the distance to the power of 1, making it equivalent to exponent=1. Is this correct, or am I misunderstanding something?

Thanks!

Add distance skewness and symmetry test

In https://doi.org/10.1016/j.jspi.2013.03.018 (https://pages.stat.wisc.edu/~wahba/stat860public/pdf4/Energy/JSPI5102.pdf) a measure of asymmetry, distance skewness, is described, as well as a test of symmetry using it. We should attempt to implement it in this package.

Implement distance skewness.
Implement symmetry test.

version returns 0.0. Version number is on a separate file

Hello. Thank you for this very useful package. I need to query the version installed and check that it is >=0.5.3.

In dcor/init.py

try:
with open(_os.path.join(_os.path.dirname(file),
'..', 'VERSION'), 'r') as version_file:
version = version_file.read().strip()
except IOError as e:
if e.errno != _errno.ENOENT:
raise

__version__ = "0.0"

You are reading the version from the VERSION file and at the end anyway forcing the version number to be 0.0. This is always returning 0.0 when i do

import dcor
print(dcor.version)

Distance correlation of matrix and vector.

dcor returns a scalar for the distance correlation of a matrix and a vector. I cannot yet understand why this is the case as isn't the distance correlation defined between two vectors and so I would expect a vector of the correlations as the output.

Could you explain what's going on?

Incorrect documentation about arbitrary dimensions

Hello,

The documentation seems to suggest that I can pass n-dim arguments to distance_correlation, however as soon as I pass a 41318, 2, 5 tensor, I get errors. Reading #50 seems to suggest that I need to reshape the input. Reshaping by flattening the inner dimensions fixes the assertion errors, but this leads me to believe that the function does not actually implement n-dim arguments.
Row-wise calculations as suggested imply that I need to unroll the dimensions manually to do the calculations?
In which case, IMHO the documentation isn't really accurate of the methods implemented.

Thanks for the library by the way, much appreciated!

Accelerate the rowwise AVL implementation of distance covariance using the GPU

Numba supports GPU programming, but most numpy functions are unsupported, which makes almost impossible sharing code for the GPU and CPU implementations.

If, however, any charitable soul wants to try to implement an alternative version of rowwise using the AVL implementation, accelerated using the GPU via Numba, it would be very helpful.

Is there a fast way of doing pairwise distance correlation (dcor.distance_correlation)

Hi,

I am trying to do a pairwise distance correlation for every column in a pandas dataframe of shape (1000, 10000) - i want to do a pairwise correlation of all columns (so 10k pairwise correlations, each column by every other column.)

if i run the following code:

dist_corr = lambda column1, column2: dcor.distance_correlation(column1, column2)
d_corr = df.apply(lambda col1: df.apply(lambda col2: dist_corr(col1, col2)))

this takes far too long, many many hours and in some cases doesn't finish. Is there an implementation that is more optimised? any advice would be much appreciated.

thank you

Study and implement energy-based clustering

As mentioned in https://doi.org/10.1016/j.jspi.2013.03.018 (https://pages.stat.wisc.edu/~wahba/stat860public/pdf4/Energy/JSPI5102.pdf), the energy distance can be used to implement a linkage method for hierarchical clustering.

We should study the best way to implement it, if possible in a manner compatible with existing hierarchical clustering methods, such as scipy methods and scikit-learn AgglomerativeClustering class.

Help with understanding the homogeneity test

I have an experiment wherein i have two groups of customers with the same attributes. I wanted to do a multivariate homogeneity test for this and used the dcor.homogeneity.energy_test() method on both the groups. My question is that i always end up with a p value of 1 or close to 1. I simulated a 2 d dataset in two cases a.) There are two distinct clusters seperated b.) The data clusters are overlapping. The p value in both the cases came out to be 1 although the test statistic value was different. I want to understand how the homogeneity test works? Help is much appreciated.

Can dcor with method 'AVL' or 'megresort' is applicable between two data types float and integer, respectively or it always has to be float?

Energy distance using medians?

Hi, thank you for your phenomenal work writing and documenting this library.

As I'm sure you're aware, there has been some literature suggesting that an energy statistic that is more robust to outliers can be calculated by taking the median rather than mean when calculating the average distance between samples. See: James, N. A., Kejariwal, A., & Matteson, D. S. (2016). Leveraging cloud data to mitigate user experience from ‘breaking bad.’ 2016 IEEE International Conference on Big Data (Big Data), 3499–3508. https://doi.org/10.1109/BigData.2016.7841013. Specifically section 3a of that article, "Robustness against Anomalies".

From looking at this library, it seems to me that this change would be as simple as allowing a configurable "average" function which would replace the use of mean in this code:

dcor/dcor/_energy.py

Lines 24 to 28 in e735155

    
           def _energy_distance_from_distance_matrices( 
        
                   distance_xx, distance_yy, distance_xy): 
        
               """Compute energy distance with precalculated distance matrices.""" 
        
               return (2 * np.mean(distance_xy) - np.mean(distance_xx) - 
        
                       np.mean(distance_yy))

Would you be interested in such an implementation?

AttributeError: 'float' object has no attribute 'dtype'

Hi,

I am getting the following error with the latest release but not with version 0.5:

multiprocess.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "D:\Anaconda3\envs\tuneta\lib\site-packages\multiprocess\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "D:\Projects\tuneta\tuneta\optimize.py", line 228, in fit
    self.study.optimize(
  File "D:\Anaconda3\envs\tuneta\lib\site-packages\optuna\study\study.py", line 400, in optimize
    _optimize(
  File "D:\Anaconda3\envs\tuneta\lib\site-packages\optuna\study\_optimize.py", line 66, in _optimize
    _optimize_sequential(
  File "D:\Anaconda3\envs\tuneta\lib\site-packages\optuna\study\_optimize.py", line 163, in _optimize_sequential
    trial = _run_trial(study, func, catch)
  File "D:\Anaconda3\envs\tuneta\lib\site-packages\optuna\study\_optimize.py", line 264, in _run_trial
    raise func_err
  File "D:\Anaconda3\envs\tuneta\lib\site-packages\optuna\study\_optimize.py", line 213, in _run_trial
    value_or_values = func(trial)
  File "D:\Projects\tuneta\tuneta\optimize.py", line 229, in <lambda>
    lambda trial: _objective(self, trial, X, y),
  File "D:\Projects\tuneta\tuneta\optimize.py", line 180, in _objective
    correlation = distance_correlation(
  File "D:\Projects\tuneta\tuneta\utils.py", line 39, in distance_correlation
    dis = dcor.distance_correlation(a, b)
  File "D:\Anaconda3\envs\tuneta\lib\site-packages\dcor\_dcor.py", line 1049, in distance_correlation
    distance_correlation_sqr(
  File "D:\Anaconda3\envs\tuneta\lib\site-packages\dcor\_dcor.py", line 928, in distance_correlation_sqr
    return method.value.dcor_sqr(
  File "D:\Anaconda3\envs\tuneta\lib\site-packages\dcor\_dcor.py", line 190, in <lambda>
    return lambda *args, **kwargs: self._dispatch(
  File "D:\Anaconda3\envs\tuneta\lib\site-packages\dcor\_dcor.py", line 173, in _dispatch
    return getattr(DistanceCovarianceMethod.AVL.value, method)(
  File "D:\Anaconda3\envs\tuneta\lib\site-packages\dcor\_dcor.py", line 145, in <lambda>
    self.dcor_sqr = lambda *args, **kwargs: self.stats_sqr(
  File "D:\Anaconda3\envs\tuneta\lib\site-packages\dcor\_dcor.py", line 103, in <lambda>
    lambda *args, **kwargs: _distance_stats_sqr_generic(
  File "D:\Anaconda3\envs\tuneta\lib\site-packages\dcor\_dcor.py", line 391, in _distance_stats_sqr_generic
    correlation_xy_sqr = xp.asarray(0, dtype=covariance_xy_sqr.dtype)
AttributeError: 'float' object has no attribute 'dtype'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2021.3\plugins\python-ce\helpers\pydev\pydevd.py", line 1491, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2021.3\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "D:/Projects/tuneta/examples/tune_multiple.py", line 18, in <module>
    tt.fit(X_train, y_train,
  File "D:\Projects\tuneta\tuneta\tune_ta.py", line 175, in fit
    self.fitted = [fit.get() for fit in self.fitted]
  File "D:\Projects\tuneta\tuneta\tune_ta.py", line 175, in <listcomp>
    self.fitted = [fit.get() for fit in self.fitted]
  File "D:\Anaconda3\envs\tuneta\lib\site-packages\multiprocess\pool.py", line 771, in get
    raise self._value
AttributeError: 'float' object has no attribute 'dtype'

Improve performance of pairwise distances computation

The computation of pairwise distances is the main bottleneck of the naive algorithm for distance covariance. Currently we use scipy's cdist for Numpy arrays, and a broadcasting computation in other case.

Any performance improvement to this function is thus well received.

error in import dcor

Hello,

I installed the Python dcor package, and I got the following error whenever I tried to import dcor.

Traceback (most recent call last):
File "", line 1, in
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/dcor/init.py", line 14, in
from . import independence # noqa
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/dcor/independence.py", line 13, in
from ._dcor import u_distance_correlation_sqr
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/dcor/_dcor.py", line 27, in
from ._fast_dcov_avl import _distance_covariance_sqr_avl_generic
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/dcor/_fast_dcov_avl.py", line 89, in
_generate_partial_sum_2d(compiled=True))
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/decorators.py", line 200, in wrapper
disp.compile(sig)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock
return func(*args, **kwargs)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/dispatcher.py", line 768, in compile
cres = self._compiler.compile(args, return_type)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/dispatcher.py", line 81, in compile
raise retval
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/dispatcher.py", line 91, in _compile_cached
retval = self._compile_core(args, return_type)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/dispatcher.py", line 109, in _compile_core
pipeline_class=self.pipeline_class)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 551, in compile_extra
return pipeline.compile_extra(func)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 331, in compile_extra
return self._compile_bytecode()
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 393, in _compile_bytecode
return self._compile_core()
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 373, in _compile_core
raise e
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 364, in _compile_core
pm.run(self.state)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler_machinery.py", line 347, in run
raise patched_exception
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler_machinery.py", line 338, in run
self._runPass(idx, pass_inst, state)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock
return func(*args, **kwargs)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler_machinery.py", line 302, in _runPass
mutated |= check(pss.run_pass, internal_state)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler_machinery.py", line 275, in check
mangled = func(compiler_state)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/typed_passes.py", line 95, in run_pass
raise_errors=self._raise_errors)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/typed_passes.py", line 66, in type_inference_stage
infer.build_constraint()
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/typeinfer.py", line 938, in build_constraint
self.constrain_statement(inst)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/typeinfer.py", line 1274, in constrain_statement
self.typeof_assign(inst)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/typeinfer.py", line 1345, in typeof_assign
self.typeof_global(inst, inst.target, value)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/typeinfer.py", line 1444, in typeof_global
typ = self.resolve_value_type(inst, gvar.value)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/typeinfer.py", line 1366, in resolve_value_type
raise TypingError(msg, loc=inst.loc)
numba.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name '_dyad_update': cannot determine Numba type of <class 'function'>

File "anaconda3/lib/python3.7/site-packages/dcor/_fast_dcov_avl.py", line 70:
def _partial_sum_2d(x, y, c, ix, iy, sx_c, sy_c, c_sum, l_max,

    dyad_update = _dyad_update_compiled if compiled else _dyad_update
    ^

dcor requires Python '>=2.7, >=3.5, <4' but the running Python is 2.7.14

hi, is this weird when i pip install docr but failed

Numba support

I'm trying to use distance correlations as a metric for computing UMAP embeddings. This requires Numba support.

Is there a fundamental reason why dcor.correlation_distance can't support Numba, or is it just a matter of going over the code?

PyPI release?

Hi, any chance of a PyPI release with the last 3 PRs included? Thanks.

Possible more efficient implementation

Review and implement the algorithm in https://doi.org/10.1016/j.csda.2019.01.016, and compare results with the current one.

Process killed due to very large array

Hello,
I am trying to get the distance correlation between two very large vectors (25k each), and the dcor function gets killed due to out of memory error. How can we fix that?

dcor.distance_correlation(np.array(x, dtype=np.float32), np.array(y, dtype=np.float32), exponent=0.5)

Maybe, but does the general code not work in that case?

          Maybe, but does the general code not work in that case?

Originally posted by @vnmabus in #62 (comment)

Add references in the documentation of each method.

Although there are references in the theory and readme docs, the individual methods (such as the energy test) should reference their original publications.

Implement an optimized `rowwise` for the MERGESORT implementation

The MERGESORT implementation of distance covariance could also implement an optimized rowwise method using Numba.

Counting the distance from a point to itself

Hi, I've hit a bit of a problem. I was trying to work out why I was getting different results from the ecp R package versus dcor. After some intense investigation, I think the cause seems to be at the point of taking the mean of each within-sample distance. Note, this is before we apply the coefficient or consider the between-sample distances. Precisely, I'm referring to the mean taken here:

dcor/dcor/_energy.py

Lines 41 to 42 in 161a6f5

    
           average(distance_xx) - 
        
           average(distance_yy)

In all the Székely and Rizzo papers (e.g. Székely & Rizzo, 2004), this mean is defined as the arithmetic mean, and the same as you have used in dcor:

However in the Matteson and James papers I have been looking at (e.g. Matteson & James, 2014; James et al., 2016), they seem to define it as follows:

What they seem to be doing here is summing the lower triangle of the matrix, excluding the diagonal, and then divided by the combination n choose 2. So if we had a sample with 5 items, the full distance matrix would be 5 x 5 = 25 items, but the lower triangle would only have 10 items in it. They would sum these distances and divide by 5 choose 2, which is 10. So this is also taking the mean, but it's the mean excluding the diagonal, which is of course always 0 in a within-sample distance matrix. The ultimate outcome is that their "mean" is actually $\frac{n}{n-1} \mu$ , which is larger than it should be, as it isn't counting the 0s on the diagonal.

Note that this is also visible in the implementation of their work, in the ecp package. Here, they sum the matrix but then divide by $n \times n - 1$ , which is equivalent to the above, but not equivalent to the true mean:
https://github.com/zwenyu/ecp/blob/65a9bb56308d25ce3c6be4d6388137f428118248/src/energyChangePoint.cpp#L112

My question is this: are they simply wrong? If no, is there any theory supporting this alternative formula? If there is, should this be something supported in dcor? Fortunately it kind of already is thanks to my customizable average feature. But it could be called out specifically. I appreciate your input here as you likely understand this domain better than I do.

Add goodness-of-fit tests

Energy distance can be used to perform goodness-of-fit tests, as mentioned in https://doi.org/10.1016/j.jspi.2013.03.018 (https://pages.stat.wisc.edu/~wahba/stat860public/pdf4/Energy/JSPI5102.pdf).

It would be useful to create a new submodule goodness that could include some of the following:

Two-parameter exponential distribution goodness-of-fit test.
Uniform distribution goodness-of-fit test.
Univariate normality goodness-of-fit test.
Multivariate normality goodness-of-fit test.
Pareto distribution goodness-of-fit test.
Poisson distribution goodness-of-fit test.
Uniform distribution goodness-of-fit test.
Stable distributions goodness-of-fit test.

add support for faster dcor

in this paper: https://arxiv.org/pdf/1810.11332.pdf an O(n log n) version of the algorithm is reported. it may be worthwhile to recode into python.

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\domin\\PycharmProjects\\Trading_Backtesting_ML\\venv\\lib\\site-packages\\dcor\\pycache\\_fast_dcov_mergesort._generate_distance_covariance_sqr_mergesort_generic_impl.locals._distance_covariance_sqr_mergesort_generic_impl-163.py38.nbi.tmp.4ae6be2f415b45ff'

Cannot import dcor as for some reason the pycache folder seems to be missing something

Implement distance components (DISCO)

Distance components (DISCO) is an extension of ANOVA using the energy distance. We should definitely try to implement this method in dcor.

	def _energy_distance_from_distance_matrices(
	distance_xx, distance_yy, distance_xy):
	"""Compute energy distance with precalculated distance matrices."""
	return (2 * np.mean(distance_xy) - np.mean(distance_xx) -
	np.mean(distance_yy))

vnmabus / dcor Goto Github PK

dcor's People

Contributors

Stargazers

Watchers

Forkers

dcor's Issues

Recommend Projects

Recommend Topics

Recommend Org