Hi I've been using kmodes (<a href="https://www.rdocumentation.org/packages/klaR/versi

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Determining the optimal number of clusters about kmodes HOT 7 OPEN

nicodv commented on July 25, 2024

Determining the optimal number of clusters

from kmodes.

Comments (7)

PabloVergara commented on July 25, 2024 7

Using silhouette for the numerical variables, and continue using the cost for all
with a small change here in kprototypes.py

and this piece of code in the implementation:

lista=[]
for i in range(20,23):
    nc=i
    start = time.time()
    kp = KPrototypes(n_clusters = nc, init = 'Cao', n_init =22, verbose = 1, random_state=4, n_jobs=8 )
    clusters=kp.fit_predict(data.values, categorical = [9])
    end = time.time()
    lista.append([i,"Silhouette Coefficient: %0.3f"% metrics.silhouette_score(data.iloc[:,0:9], kp.labels_),'cost: %0.3f'%kp.cost_,
                                     'tiempo (s): %0.3f'% (end-start),'best run: %0.3f'% (list(kp.best.keys())[0]+1)])

you can have a half result

from kmodes.

dexdimas commented on July 25, 2024 1

And how do you determine the optimal k for the k-prototypes?

I am working on doing clustering on mixed categorical and numerical attributes. When I stumbled across your k-prototypes implementation, I want to implement it in my case. However, I'm a bit confused on how to evaluate the result from the k-prototypes algorithm (e.g. determine the optimal k).

But as mentioned that silhouette plot would do the trick, I've been thinking to change the Euclidean distance into the k-prototypes cost function to determine the intra- and inter- cluster distance on silhouette analysis.

Do you think that would work?

from kmodes.

nicodv commented on July 25, 2024

Simply by running the clustering for multiple k values, as there currently is no wrapper that does this for you automatically.

It would be nice to combine this with the silhouette plot mentioned here

PRs are welcomed. :)

from kmodes.

doyager commented on July 25, 2024

@dexdimas

Hi @dexdimas , @nicodv , All

I am also working with K-prototypes , and trying to find the optimal K value, can you please share your experience/approach to find optimal K when using K-prototypes,it would be great if you can share some code and links .

Any suggestions for plotting very hight dimensional data , I am working with 56 features where I have 35 categorical columns[ 3 cols have about 10,000 categories and all others have about 10-12 categories] , 11 Numerical columns and 10 binary columns, with data size of 80 Million records

ps: I am trying to find patterns and outliers , trying to find outliers that would not fit in with normal clusters, I am using health care data.

Thank you in advance , any help is appreciated.

from kmodes.

supreetkt commented on July 25, 2024

Hi @nicodv,

I'm working on an implementation of silhouette score, which uses dissimilarity (between each element of the array) as a distance metric and gives the optimal number of clusters, k. What other metric would you consider as a good basis for silhouette score calculation?

from kmodes.

matiasscorsetti commented on July 25, 2024

hello,

how to calculate the silhouette score in k prototypes, if I have a silhouette score of categorical data (hamming) and a silhouette score of numerical data (euclidean)?
Should I average weighted between the two coefficients according to the gamma value?

How would this weighted average be calculated?

It could be done this way:

( silhouette_category * kp.gamma ) + ( silhouette_numeric * (1 - kp.gamma ) )

thanks

from kmodes.

arnaud-nt2i commented on July 25, 2024

@matiasscorsetti
gamma is not from [0,1] (a proportionality coef) but from [0,+inf[

From reading the R implementation of "silhouette_kproto" line 1134 : Rdocumentation
(gamma is called lambda there)

It seems to me they are weighting both silhouettes values like following:
( silhouette_category * gamma ) + ( silhouette_numeric )

but I may be wrong...

an idea @nicodv ?

from kmodes.

Determining the optimal number of clusters about kmodes HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent