Giter Site home page Giter Site logo

Comments (7)

PabloVergara avatar PabloVergara commented on July 25, 2024 7

Using silhouette for the numerical variables, and continue using the cost for all
with a small change here in kprototypes.py
Captura4

and this piece of code in the implementation:

lista=[]
for i in range(20,23):
    nc=i
    start = time.time()
    kp = KPrototypes(n_clusters = nc, init = 'Cao', n_init =22, verbose = 1, random_state=4, n_jobs=8 )
    clusters=kp.fit_predict(data.values, categorical = [9])
    end = time.time()
    lista.append([i,"Silhouette Coefficient: %0.3f"% metrics.silhouette_score(data.iloc[:,0:9], kp.labels_),'cost: %0.3f'%kp.cost_,
                                     'tiempo (s): %0.3f'% (end-start),'best run: %0.3f'% (list(kp.best.keys())[0]+1)])

you can have a half result
image

from kmodes.

dexdimas avatar dexdimas commented on July 25, 2024 1

And how do you determine the optimal k for the k-prototypes?

I am working on doing clustering on mixed categorical and numerical attributes. When I stumbled across your k-prototypes implementation, I want to implement it in my case. However, I'm a bit confused on how to evaluate the result from the k-prototypes algorithm (e.g. determine the optimal k).

But as mentioned that silhouette plot would do the trick, I've been thinking to change the Euclidean distance into the k-prototypes cost function to determine the intra- and inter- cluster distance on silhouette analysis.

Do you think that would work?

from kmodes.

nicodv avatar nicodv commented on July 25, 2024

Simply by running the clustering for multiple k values, as there currently is no wrapper that does this for you automatically.

It would be nice to combine this with the silhouette plot mentioned here

PRs are welcomed. :)

from kmodes.

doyager avatar doyager commented on July 25, 2024

@dexdimas

Hi @dexdimas , @nicodv , All

I am also working with K-prototypes , and trying to find the optimal K value, can you please share your experience/approach to find optimal K when using K-prototypes,it would be great if you can share some code and links .

Any suggestions for plotting very hight dimensional data , I am working with 56 features where I have 35 categorical columns[ 3 cols have about 10,000 categories and all others have about 10-12 categories] , 11 Numerical columns and 10 binary columns, with data size of 80 Million records

ps: I am trying to find patterns and outliers , trying to find outliers that would not fit in with normal clusters, I am using health care data.

Thank you in advance , any help is appreciated.

from kmodes.

supreetkt avatar supreetkt commented on July 25, 2024

Hi @nicodv,

I'm working on an implementation of silhouette score, which uses dissimilarity (between each element of the array) as a distance metric and gives the optimal number of clusters, k. What other metric would you consider as a good basis for silhouette score calculation?

from kmodes.

matiasscorsetti avatar matiasscorsetti commented on July 25, 2024

hello,

how to calculate the silhouette score in k prototypes, if I have a silhouette score of categorical data (hamming) and a silhouette score of numerical data (euclidean)?
Should I average weighted between the two coefficients according to the gamma value?

How would this weighted average be calculated?

It could be done this way:

( silhouette_category * kp.gamma ) + ( silhouette_numeric * (1 - kp.gamma ) )

thanks

from kmodes.

arnaud-nt2i avatar arnaud-nt2i commented on July 25, 2024

@matiasscorsetti
gamma is not from [0,1] (a proportionality coef) but from [0,+inf[

From reading the R implementation of "silhouette_kproto" line 1134 : Rdocumentation
(gamma is called lambda there)

It seems to me they are weighting both silhouettes values like following:
( silhouette_category * gamma ) + ( silhouette_numeric )

but I may be wrong...

an idea @nicodv ?

from kmodes.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.