Giter Site home page Giter Site logo

How to use k-medoids? about clustering.jl HOT 9 CLOSED

juliastats avatar juliastats commented on August 23, 2024
How to use k-medoids?

from clustering.jl.

Comments (9)

kingzbauer avatar kingzbauer commented on August 23, 2024 2

This might be related. I keep getting this error when trying to run kmedoids

julia> kmedoids(mat, 8)
ERROR: AssertionError: !(isempty(grp))
 in _find_medoid at /root/.julia/v0.4/Clustering/src/kmedoids.jl:189
 in _kmedoids! at /root/.julia/v0.4/Clustering/src/kmedoids.jl:100
 in kmedoids at /root/.julia/v0.4/Clustering/src/kmedoids.jl:39

mat is a distance matrix.

from clustering.jl.

johnmyleswhite avatar johnmyleswhite commented on August 23, 2024

I think @lendle may be the main person who knows about this code.

from clustering.jl.

lendle avatar lendle commented on August 23, 2024

The k-medoids code was rewritten by @cyocum in #22.

from clustering.jl.

lendle avatar lendle commented on August 23, 2024

I just looked at the implementation and docs. C is actually an n x n matrix (not k x n).

The docs say "C – The cost matrix, where C[i,j] is the cost of assigning sample j to the medoid i" is a bit unclear. The ith row does correspond to the ith medoid for i = 1, ..., k, but corresponds to the cost associated with assigning each sample to a medoid defined by sample i for i = 1, ..., n.
@waTeim, would "C – The cost matrix, where C[i,j] is the cost of assigning sample j to a cluster with medoid sample i" be more clear?

from clustering.jl.

waTeim avatar waTeim commented on August 23, 2024

I don't think so. Are you sure because that approach has a lot of problems. It's not really the algorithm as documented in Wikipedia, it should be kxn and be re-calculated every iteration because the medoids can change. If it's nxn then that becomes unusable because take for instance how I was going to use it -- 90,000 data points gives rise to a 90000x90000 matrix which uses up more memory that the machine I have to run it on (40 GB RAM), and frankly that's kinda small compared to todays dataset sizes.

from clustering.jl.

lindahua avatar lindahua commented on August 23, 2024

The current approach takes as input a pre-computed pairwise cost matrix. When n is very large, one should use a different algorithm.

from clustering.jl.

waTeim avatar waTeim commented on August 23, 2024

Yea, that's more efficient if the number of rows^2 is storable, but in this case, nope. And like I said, this isn't even the largest of the datasets. I was looking around for a simple implementation of PAM to submit as a PR so there there would be options, but am stuck on the "compare cost of each non-medioid with a mediod cost and swap if lower" seems too inefficient to implement directly like that.

from clustering.jl.

diegozea avatar diegozea commented on August 23, 2024

@waTeim If your matrix is symmetric, maybe https://github.com/diegozea/PairwiseListMatrices.jl can be useful to you.

from clustering.jl.

alyst avatar alyst commented on August 23, 2024

This is an old question, that does not apply to the new distance-based API.

from clustering.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.