Comments (7)
Using silhouette for the numerical variables, and continue using the cost for all
with a small change here in kprototypes.py
and this piece of code in the implementation:
lista=[]
for i in range(20,23):
nc=i
start = time.time()
kp = KPrototypes(n_clusters = nc, init = 'Cao', n_init =22, verbose = 1, random_state=4, n_jobs=8 )
clusters=kp.fit_predict(data.values, categorical = [9])
end = time.time()
lista.append([i,"Silhouette Coefficient: %0.3f"% metrics.silhouette_score(data.iloc[:,0:9], kp.labels_),'cost: %0.3f'%kp.cost_,
'tiempo (s): %0.3f'% (end-start),'best run: %0.3f'% (list(kp.best.keys())[0]+1)])
from kmodes.
And how do you determine the optimal k for the k-prototypes?
I am working on doing clustering on mixed categorical and numerical attributes. When I stumbled across your k-prototypes implementation, I want to implement it in my case. However, I'm a bit confused on how to evaluate the result from the k-prototypes algorithm (e.g. determine the optimal k).
But as mentioned that silhouette plot would do the trick, I've been thinking to change the Euclidean distance into the k-prototypes cost function to determine the intra- and inter- cluster distance on silhouette analysis.
Do you think that would work?
from kmodes.
Simply by running the clustering for multiple k values, as there currently is no wrapper that does this for you automatically.
It would be nice to combine this with the silhouette plot mentioned here
PRs are welcomed. :)
from kmodes.
I am also working with K-prototypes , and trying to find the optimal K value, can you please share your experience/approach to find optimal K when using K-prototypes,it would be great if you can share some code and links .
Any suggestions for plotting very hight dimensional data , I am working with 56 features where I have 35 categorical columns[ 3 cols have about 10,000 categories and all others have about 10-12 categories] , 11 Numerical columns and 10 binary columns, with data size of 80 Million records
ps: I am trying to find patterns and outliers , trying to find outliers that would not fit in with normal clusters, I am using health care data.
Thank you in advance , any help is appreciated.
from kmodes.
Hi @nicodv,
I'm working on an implementation of silhouette score, which uses dissimilarity (between each element of the array) as a distance metric and gives the optimal number of clusters, k. What other metric would you consider as a good basis for silhouette score calculation?
from kmodes.
hello,
how to calculate the silhouette score in k prototypes, if I have a silhouette score of categorical data (hamming) and a silhouette score of numerical data (euclidean)?
Should I average weighted between the two coefficients according to the gamma value?
How would this weighted average be calculated?
It could be done this way:
( silhouette_category * kp.gamma ) + ( silhouette_numeric * (1 - kp.gamma ) )
thanks
from kmodes.
@matiasscorsetti
gamma is not from [0,1] (a proportionality coef) but from [0,+inf[
From reading the R implementation of "silhouette_kproto" line 1134 : Rdocumentation
(gamma is called lambda there)
It seems to me they are weighting both silhouettes values like following:
( silhouette_category * gamma ) + ( silhouette_numeric )
but I may be wrong...
an idea @nicodv ?
from kmodes.
Related Issues (20)
- k-prototype seems to focus on one continuous variable HOT 1
- Reduce memory usage in array initialization HOT 2
- GPU ( cuda ) support? HOT 1
- Add L1 as a dissimilarity function option for continuous variables HOT 1
- Performance over binary data HOT 1
- parallelization HOT 4
- KPrototypes fit_predict fails with sample_weight HOT 2
- Apologies if this is redundant but I could not find documentation ... how do you extract class membership from an object created by the function KPrototypes HOT 1
- What are the minimum characteristics that a binary matrix must meet to avoid the following error: "Insufficient Number of data since union is 0"? HOT 1
- ValueError: All arrays must be of the same length HOT 3
- Euclidean distance definiton lacks a square root HOT 2
- Support Arm64 macos HOT 1
- Please add conda installation information HOT 1
- Different clusters when K-Prototypes trained on same data in numpy array and pandas dataframe HOT 1
- Li
- Estimation of Gamma in K-Prototypes HOT 1
- [BUG] Badge not rendering in readme HOT 2
- Incorrect dtype conversion of categoricals when dealing with manually assigned centroids HOT 2
- Create equal-sized clusters within kmodes HOT 1
- Value Error when I pass a NumPy array as init parameter HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kmodes.