ssarfraz / finch-clustering Goto Github PK

View Code? Open in Web Editor NEW

314.0 314.0 58.0 111.48 MB

Source Code for FINCH Clustering Algorithm

License: Other

Python 10.30% MATLAB 16.63% Shell 0.12% Jupyter Notebook 72.94%

finch-clustering's People

Contributors

Stargazers

Watchers

Forkers

ml-lab changliu816 locussam timewarlock zixinyi lonely-geese xifengguo gaiya2050 mosaddek-hossain andudu steven99999 allenmujie seven-xu qinghaizheng1992 ustczhouyu bio-ontology-research-group phymucs addingding guoflyfly shiyongde fagan2888 hbredin trantorrepository penghu-cs yurongchen1998 aquuuuf vivoutlaw yyht square187 hlj2021 798283635 qiqi12 vhientran mrshouxingma takuyara xlsean rotcx its-gucci ravipr009 lujunyihhh moishekeselman jaidevshriram avadakarrot chasemonsteraway magoning fsgdrq hyp-code gbtunze wang-shuibin pxzheng01 smallpokonyan melika-zabihi ustlzh anushaabdulla mr2cool josephrp

finch-clustering's Issues

The performance of FINCH on Aggregation

The code is working fine. But the performance I have got is always 0.96536 in terms of NMI (implemented in sklearn.metrics).
The code I run is as follows:

import numpy as np
import scipy.io as sio
from sklearn.metrics import normalized_mutual_info_score as nmi
from .finch import FINCH

data = sio.loadmat("Agg.mat")
X = data["X"]
y_true = data["Y"]
c_true = len(np.unique(y_true))

Y, num_clu, req_y = FINCH(X, req_clust=c_true, distance='euclidean') # or cosine
acc = nmi(y_true, req_y, average_method="max")
print(acc)

Looking forward to your reply

array is empty with s1 dataset

why the algorithm triger an error when working on s1 dataset from http://cs.joensuu.fi/sipu/datasets/

~/finchcls.py in update_adj(self, adj, d)
94 v = np.argsort(d[idx])
95 v = v[:2]
---> 96 x = [idx[0][v[0]], idx[0][v[1]]]
97 y = [idx[1][v[0]], idx[1][v[1]]]
98 a = sp.lil_matrix(adj.get_shape())

IndexError: index 0 is out of bounds for axis 0 with size 0

the same error with a1 dataset and "unbalance" dataset

any other datasets it works fine

Code for TW-FINCH

It would be great if you could publish your code for TW-FINCH, since it is a bit hard to replicate the results from the paper.

about output

Thank your opened code,I want to know what mean about output of 'C', It is a N*2 array，what which is cluster label？ I found about my data get bad result ,I want to reason.

input precomputed distance matrix instead of data

Hi, thanks for your great job. How to input a precomputed distance matrix instead of data? Could you please release a version ?

It is amazing that this unsupervised clutering method outperforms other paradigms on five challenging action segmentation datasets. However, some details puzzle me a lot, just about how to map the obtainded segments with different action labels (including background) using Hungarian algorithm. It would pretty appreciate if these problems would be explained.

Replace `sklearn` with `scikit-learn` in `setup.py`

The former is deprecated and pip throws a hissy fit

great work, waiting for the python code

Finch Algo 2

Thank you for the greate method and code.
As far as I understand, I think algo2 is needed for evaluation, but I don't think there is a corresponding python code.

Error when runninng TW_FINCH and specifying the number of clusters.

Hello,
Thank you for publishing your excellent work.

I was testing the TW_FINCH for clustering and it has been working well, but when I tried to specify the exact number of clusters I wanted, I got the following error:

    [186]  ind = [i for i, v in enumerate(num_clust) if v >= req_clust]
--> [187]  req_c = req_numclust(c[:, ind[-1]], data, req_clust, distance, use_tw_finch=tw_finch)
    [188]else:
    [189]  req_c = c[:, num_clust.index(req_clust)]

IndexError: list index out of range```

It seems to be in the c[:,ind[-1]] call.

What could be the reason behind this error?

Thank you.

Could you please provide the tool for visualizing the Figure 2?

Dear @ssarfraz ,
I am sorry for disturbing you, but could you please describe in more detail the tool or the source code you visualize the Figure 2 in your paper? Thank you so much!

About Output

Is there any randomness in the clustering results?

Hi, I fixed the random seed and input data and then applied FINCH for clustering. But I found that the results obtained by each clustering are different, what should I do to ensure that I can get a fixed result every time?

P.S. I have a large amount of data (hundreds of thousands) and use the NNDescent method in 'pynndescent', is it possible that this is the cause？ What can I do?

Looking forward to your reply, thank you very much

errors when run the run_on_dataset.m

hello,thank you for posting the code for the TWFinch and great work！I have tried to reproduce the results,but I meet some problems when I run the run_on_dataset.m.

I downloaded the data and put it under E:\FINCH-Clustering-master\TW-FINCH,
then I run the script tw_finch = true Result = run_on_dataset('50Salads', tw_finch, 'E:\FINCH-Clustering-master\TW-FINCH\Action_Segmentation_Datasets');
the error is as follows

element of adjacent matrix may greater than 1

thanks for this amazing and practical algorihtm
when I browse the python ver code, I find the element of adjacent matrix may greater than 1 as below

csr_matrix in python.finch.py line45
0, 0, 0, 0, 1
0, 0, 0, 0, 1
0, 1, 0, 0, 0
0, 0, 0, 0, 1
0, 1, 0, 0, 0
adjacent matrix in line50
0, 1, 0, 1, 1
1, 0, 1, 1, 2
0, 1, 0, 0, 1
1, 1, 0, 0, 1
1, 2, 1, 1, 0

maybe this will impact the value of min_sim in hierarchy cluster line155

Segmentation fault for large dataset of 5M datapoints of 1024 dimensions

The segmentation faults occurs at the call of NNDescent function where RP-trees are being built and descent steps are about to start. I am using the H-NNE (koulakis/h-nne#17) algorithm which uses FINCH under the hood.

Why does edge (vertices are the most similar nodes of each other) have greater weight when compare with min_sim

thanks for your work!!! I love it very much!!
I want to know Why does edge (vertices are the most similar nodes of each other) have greater weight when compare with min_sim?
In the src code, the weight of edge(vertices are the most similar nodes of each other) is 2, while others is 1 when compared with min_sim

features for Hollywood and MPII Cooking 2

Hi~ Thanks for releasing your code and great work! I would appreciate if you can help me with the features of these two datasets.

`sklearn` is still a dependency in `setup.py`

Commit b508b1a intended to remove sklearn dependency, but actually removed scipy. You can check the commit's diff here.

scipy is still installed since it's a dependency of scikit-learn, but we also get the deprecated sklearn package.

This means that the problem from #29 still affects finch-clust==0.1.8. We can check it by doing the following (based on How to test whether a package will be affected by the sklearn deprecation):

SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=False \
    pip install finch-clust==0.1.8

IndexError when req_clust > num_clust

I call finch using
cluster_partition, n_part_clust, part_labels = FINCH(data, req_clust=2)

and receive this error

line 185, in FINCH
    req_c = req_numclust(c[:, ind[-1]], data, req_clust, distance, use_ann_above_samples, verbose)
IndexError: list index out of range

My best guess is, that there is only one cluster, so the condition v >= req_clust is never fulfilled in ind = [i for i, v in enumerate(num_clust) if v >= req_clust], thereby the index list is empty, thereby ind[-1] is out of range.

What is the implication and how to best deal with this?

Unable to replicate numbers

Hey,

I was trying to replicate the numbers presented in the paper with the features provided and my numbers seem to be a bit on the lower side. Without changing anything, I ran the python version of the code, and what i noticed was on breakfast I am getting an MOF of 60.1 whereas the reported number is 62.7. Similarly for MPII, I am getting 41.51 but reported number is 42.0 (Though very minor). Is there a reason for this discrepancy?

The code for TW-FINCH is not available

[

Different Clustering results when using python and matlab implementation

I realized, that the python implementation does yield different results than the matlab version.

This I have found out by first comparing a python evaluation of the tw-finch clustering results against the provided matlab evaluation one, with one of the provided datasets and the features from the TW-FINCH paper.
After looking a little more into the issue, I have found that already in the first steps of the clustering process, both version assign the same features/frames to different clusters and the number of clusters is drastically different too, which explains the performance differences.

Have you encountered this issue and if yes are there solutions?

how to handle dataset that is too large to be loaded in the memory?

Suppose I have a dataset of 3000w items, each item is a 2048-d vector.
Thanks

How to get the midpoint hit criterion for the MPII?

All I konw is how to get the percision and recall. But I don't know how to get the midpoint hit, and they are different.

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 4 dimensions(s) and the array at index 1 has 2 deminesion (s).

Hi, I tried to convert my video into a numpy array as method shown here (https://stackoverflow.com/questions/67644826/how-to-convert-a-video-to-a-numpy-array) . And now when I pass it as a input to the function as FINCH(data, req_clust=K, tw_finch=True) I am getting :
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 4 dimensions(s) and the array at index 1 has 2 deminesion (s). The shape of my data right now is (928, 108, 108, 3)

How do I fix this? Is there any other method to get feature vector of a video ? I really appreciate the response !

TWFinch code missing FS "Eval" option; unable to reproduce accuracy

Hello, thank you for posting the code and data for the TWFinch paper.

The code seems to be missing an option to run the FS "Eval" dataset.
I've made a logical change to your code (below) to load this dataset, but am unable to reproduce the accuracy in the paper, which was reported as MoF= 71.1%.

The following change to TW-FINCH/util_fns/read_video.m produces an accuracy of MoF:= 66.7%:

 elseif strcmp(Dataset, 'FS')
    map=readtable(fullfile(mapping_path, 'mappingeval.txt'));
    map2=table([1:numel(map.Var2)]', 'RowNames', map.Var2);
    gt_label_str=table2cell(readtable(fullfile(gt_path, vid_name), 'Delimiter', '#', 'ReadVariableNames',false));
    gt_label_frame=table2array(map2(gt_label_str,1));

I would appreciate any guidance on what might be wrong. Thank you.

Any random factors in this algorithm?

Got different results for different trials in my experiments ...

There is a bug when using pynndescent.NNDescent

Nice work!

When my data volume is very large, I will use the "NNDescent" in the "pynndescent" library according to the "Python" code, and then an error will occur.

my “pynndescent” version is ‘0.5.5'. how to fix it?

Looking forward to your reply, thanks.

TW-FINCH feature extraction method

hello,thanks for your work!I'm sorry but this problem has been bothering me for a long time.For TW-FINCH,do the frame-wise features can only be extracted by iDT(your paper mentioned),or it can also be extracted by other CNN methods such as I3D？Will the methods affect the clustering results?