Comments (3)
Can you please clarify what do you mean by "can't support Numba"? This package uses Numba to accelerate the internal computations in the case that the "fast distance covariance algorithm" can be used instead of the original O(N^2) algorithm.
from dcor.
Hi. First let me apologize for the tone of that question. Besides being vague, it comes off as really rude. I don't know how that came from me. I wanted to know if there was something marginal that I could fix and make everything work.
Second, I can't reproduce the problem for arbitrary data. The problem doesn't happen with iris (which is small, 150 rows x 4 cols), but it does repeat with fashion-mnist (which is larger, on the order of tens of thousands of rows and (28*28) columns -- but not huge either.)
I came to believe that this was a dcor
problem because I had used other custom metrics succesfully -- but, as it turns out, not in datasets as large as fashion-mnist.
I'm including some details on reproduction attempts but it's not clearly a dcor
problem; it's probably the approximated-nearest-neighbors algorithm UMAP uses.
First, since UMAP uses the idea of nearest-neighbors (although it doesn't use the stock/exact algorithm), we try the following (code 1), which works
from sklearn.datasets import load_iris
iris = load_iris()
from sklearn.neighbors import kneighbors_graph
from numba import jit
from dcor import distance_correlation
@jit
def distcor(x,y):
return 1 - distance_correlation(x,y)
g = kneighbors_graph(iris.data, 2, mode = 'distance', metric='pyfunc',
metric_params = {'func': distcor})
Attempting to run this for fashion-mnist takes >20 minutes (the algorithm is expected to explode with large datasets anyway)-- I've given up before errors came up.
The following (code 2) runs UMAP itself. And works.
from umap import UMAP
embedding = UMAP(metric = distcor, n_neighbors = 4).fit_transform(iris.data)
Code 2 for fashion-mnist fails very loudly.
TypingError: Failed at nopython (nopython frontend)
Invalid usage of type(CPUDispatcher(<function distcor at 0x000001FC83B2B378>)) with parameters (array(float32, 1d, C), array(float32, 1d, C))
* parameterized
[1] During: resolving callee type: type(CPUDispatcher(<function distcor at 0x000001FC83B2B378>))
[2] During: typing of call at C:\Users\Diego Navarro - FGV\Anaconda3b\lib\site-packages\umap\nndescent.py (65)
File "..\..\Anaconda3b\lib\site-packages\umap\nndescent.py", line 65:
def nn_descent(
<source elided>
for j in range(indices.shape[0]):
d = dist(data[i], data[indices[j]], *dist_args)
^
This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.
To see Python/NumPy features supported by the latest release of Numba visit:
http://numba.pydata.org/numba-doc/dev/reference/pysupported.html
and
http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html
For more information about typing errors and how to debug them visit:
http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile
If you think your code should work with Numba, please report the error message
and traceback, along with a minimal reproducer at:
https://github.com/numba/numba/issues/new
So third, I made some experiments with the pynndescent
library, which appears to be a close cousin to the nndescent used inside the UMAP library. Surprisingly this doesn't work even for iris.
from pynndescent import NNDescent
index = NNDescent(iris.data, metric = distcor)
u,_=index.query(data,k=1)
Since the UMAP and pynndescent share maintainers, I probably should take that up with them.
from dcor.
Ok, thank you for the clarification. I did not see your question as rude at all, but I am not a native speaker. When I read your question I though that you were asking for GPU or compiled versions of the distance covariance/correlation functions via Numba. This is something I think is useful for speeding computations, but I do not have time to implement right now. However, if I have understood your answer correctly, your problem lied elsewere. I hope you can find and fix it easily.
from dcor.
Related Issues (20)
- Question: is there a fast method for `dcor.independence.distance_covariance_test` HOT 2
- OSError: [Errno 36] File name too long when importing dcor HOT 5
- Is there a fast way of doing pairwise distance correlation (dcor.distance_correlation) HOT 8
- __version__ returns 0.0. Version number is on a separate file HOT 6
- AttributeError: 'float' object has no attribute 'dtype' HOT 1
- Process killed due to very large array HOT 2
- FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\domin\\PycharmProjects\\Trading_Backtesting_ML\\venv\\lib\\site-packages\\dcor\\__pycache__\\_fast_dcov_mergesort._generate_distance_covariance_sqr_mergesort_generic_impl.locals._distance_covariance_sqr_mergesort_generic_impl-163.py38.nbi.tmp.4ae6be2f415b45ff' HOT 2
- Improve performance of pairwise distances computation
- Add goodness-of-fit tests
- Add distance skewness and symmetry test
- Implement distance components (DISCO)
- Study and implement energy-based clustering
- Implement energy distance in terms of distance covariance
- Adding support for python 3.7 HOT 1
- Question about the shape of the input array HOT 3
- Can dcor with method 'AVL' or 'megresort' is applicable between two data types float and integer, respectively or it always has to be float? HOT 13
- Can distance correlation-based t test is theoretically correct to implement for "uni"-dimensional data? HOT 2
- Seemingly incorrect results with `int` datatype HOT 3
- Incorrect documentation about arbitrary dimensions HOT 2
- Maybe, but does the general code not work in that case?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dcor.