Giter Site home page Giter Site logo

hierarchical_kmeans's Introduction

An package for creating hierarchical k-means/k-means tree/vocabulary tree.

Hierarchical k-means

An implementation of k-means tree, it's a kind of hierarchical k-means algorithm.

As we know, doing k-means on large dataset is really time consuming. There are many approximation algorithm. Hierarchical k-means is one of them.

This code implements a tree-like k-means. Here is an example:

Assume we are doing a two-level tree k-means, the first level is a 20-class k-means, and for each second level branch, a 10-class k-means is done. Then, we have done a k-means with totally 20*10=200 classes. When quantizing a data point, we first quantizing at first level, that is, projecting the data point to one of the 20 classes. After that, quantizing at the second level at the nearest branch.

Requirements

This implementation is based on the implementation of k-means from scipy package. Make sure you had installed scipy successfully.

Usage

To do hierarchical k-means, we should first determine the structure of hierarchical k-means. This is an example:

from hierarchical_kmeans import tree
cluster_structure = tree('my_cluster_structure')
for i in range(100):
    x = tree('l1-c%d' % (i+1))
    cluster_structure.add_child(x)
    for j in range(50):
        y = tree('l2-c%d-c%d' % (i+1, j+1))
        x.add_child(y)
        for k in range(20):
            z = tree('l3-c%d-c%d-c%d' % (i+1, j+1, k+1))
            y.add_child(z)

This code create a tree with 3 levels, it tells the constructor(when passing it to the hierarchical k-means constructor) to create a k-means tree with depth-3: for first level, doing a 100-class k-means; for each branch(100 branches) in level 2, doing a 50-class k-means; and for each branch in level 3, doing a 20-class k-means. Note that if you need a 100-class-clustering, k-means will not always get the exactly 100 classes, some class may be empty, so commonly speaking, the example given above will get a total clustering of less than 100*50*20 classes, but I promise this is not a problem.

After creating cluster structure, doing hierarchical k-means is simple:

from hierarchical_kmeans import hierarchical_kmeans as hkmeans
hk = hkmeans(cluster_structure)
hk.cluster(your_row_base_data, only_store_id=True, iteration=1)

Note that hk.cluster function accepts three arguments, the first argument is your data(row-based), the second argument only_store_id tells the algorithm to store the data id(only_store_id=True, default for saving memory) or the data(only_store_id=False) on the leaf nodes. Note that the data id(or data) is store in additional_data attribute of leaf nodes. The third argument iteration inherits from kmeans function from scipy.cluster.vq, when iteration > 1, for each branch, it chooses the best kmeans result(with lowest distortion) in iteration times of k-means.

For new data points, quantizing them to one of the classes is quick easy:

codes = hk.find_cluster(your_data_points, max_depth=-1)

If your_data_points is more than one data, i.e., two-dimensional data, it returns a list, each element is a tree node corresponding to the specific data point in your_data_points. For a single data point, it just return one tree node. max_depth tells the algorithm to find quantization at most in depth max_depth, default is -1, that is, quantizing the data to leaf nodes. If max_depth is given, the algorithm returns the quantization results(non-leaf nodes if max_depth < depth of the branches) of the specific depth.

To get the training data id of a specific leaf node(for example, finding k nearest neighbors), try:

leaf_node.get_additional_data()

where leaf_node can be the leaf node returned by hk.find_cluster.

For a non-leaf node, you can gather all the data id(or data) in its branch to output by

output = []
non_leaf_node.gather_data_from_leaves(output, gather_additional_data=True)

If gather_additional_data=False, it gathers data instead of additional_data(for this tutorial, its the centroids of some classes).

Another usage for this code is creating vocabulary tree aka BoW model. You can try it yourself.

hierarchical_kmeans's People

Contributors

github-pengge avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.