kwstat / hopkins Goto Github PK
View Code? Open in Web Editor NEWHopkins statistic for clustering
License: Other
Hopkins statistic for clustering
License: Other
I don't know whether I am doing something wrong but here's what's happening:
I have a DataFrame in the format depicted below, which are features extracted from 15 images of a class (1024 dimensions).
[1] "Number of columns:"
[1] 1024
[1] "Data frame:"
# A tibble: 15 × 1,024
n0 n1 n2 n3 n4 n5 n6 n7 n8 n9
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.160 0.0716 -0.259 0.176 0.517 -0.0688 -0.199 0.0999 0.156 0.384
2 -0.100 -0.111 -0.294 0.305 0.373 -0.227 -0.130 0.553 0.128 0.313
3 0.0758 0.0861 -0.276 0.196 0.595 -0.232 -0.0155 -0.0915 -0.000393 0.333
4 0.164 0.0172 -0.189 0.173 0.354 -0.296 -0.0317 0.0504 -0.0319 0.355
5 0.163 -0.107 -0.330 0.124 0.542 -0.296 -0.141 -0.00439 -0.0609 0.255
6 0.296 0.0430 -0.400 0.186 0.606 -0.0735 -0.186 0.0813 0.206 0.465
7 0.180 -0.0658 -0.266 0.193 0.344 -0.111 0.0569 -0.0170 0.105 0.356
8 0.175 0.0847 -0.329 0.233 0.535 -0.180 -0.121 -0.0474 0.00945 0.400
9 0.143 0.0531 -0.116 0.183 0.615 -0.246 -0.171 0.103 -0.0468 0.294
10 0.163 -0.121 -0.335 0.0410 0.802 -0.342 -0.0733 -0.149 0.0699 0.147
11 0.182 0.122 -0.264 0.239 0.571 -0.0713 -0.170 -0.0525 0.0392 0.313
12 0.290 -0.233 -0.283 0.115 0.508 -0.461 -0.0274 -0.194 -0.0963 0.272
13 0.154 -0.0282 -0.264 0.278 0.540 -0.0221 -0.225 0.141 0.205 0.293
14 0.134 0.132 -0.391 0.229 0.414 -0.172 -0.0504 0.295 0.226 0.277
15 0.107 0.0469 -0.235 0.157 0.590 -0.129 -0.0529 0.160 0.102 0.193
I then tried to run the hopkins as exemplified in the documentation:
hopkins(test, m=2)
Which yields either NaN
as ran as above or Error in numeric(3L^d) : vector size cannot be infinite
, when using torus geometry.
Another problem is when trying to set the number of samples equals to the number of rows (m=100%, i.e., 15), which outputs: m must be no larger than num of samples
(but it is actually equal, not greater then the number of samples).
By this definition Hopkins statistics will not be applicable to extremely high dimension data? like D = 4000+. This will result in either 0 or Not a Number ( Inf / Inf ).
Or if this is used:
stat = 1 / (1 + sum(dwx^d) / sum( dux^d ) )
This will result in either 0 or 1. (I’m not a English native so I didn't realize what 'it is not for our test cases' in the annotations means. Why this formula not used/usable?)
(I‘m trying to validate if there's clustering tendency in a questionaire with 102 items. Since the items plays the role being clustered, the 'dimension' here will be 4000+ subjects answering the questionaire....
So will it be fine if I repeat the process like 10000 repeats and count 1s and 0s?
Or is there any other clustering tendency index for this condition?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.