Giter Site home page Giter Site logo

fastclustering's Introduction

Efficient clustering

This package implements usual algorithms for clustering, with a bias towards optimal transport estimation. They're written in C++, with Eigen, and wrapped in Python with PyBind. Useful things you can find here are:

  • Clustering:
    • KMeans++ algorithm
    • A Vose Alias sampler for efficient sampling of discrete distributions.
    • AFKMC^2 algorithm, which uses the aforementioned sampler. The resulting algorithm has the right complexity!
  • Regularized transport:
    • Sinkhorn algorithm
    • Greenhorn algorithm, with an emphasis on stable computation: in particular, I avoid using the renormalization of P at each step, which is a O(n^2) operation.

Install

You need the two following librairies in your include path:

Those are header only librairies. You only need to download them and include them in your C++ include path. E.g, extract them in /usr/local/include and add in your .bashrc:

# Show eigen library
export CPLUS_INCLUDE_PATH="$CPLUS_INCLUDE_PATH:/usr/local/include/eigen3"

# Show PyBind library
export CPLUS_INCLUDE_PATH="$CPLUS_INCLUDE_PATH:/usr/local/include/pybind11/include"

Otherwise, you can simply include the directories directly in the CMakeLists.txt file.

I guess a more standard way is to include the librairies directly in the repo, as I did with Google Test. In the Todos.

Then, simply do pip install ., after having moved to the appropriate environment if you use conda or pyenv.

Speed comparison

For the Vose Alias sampling:

vosealias_vs_numpy

For the KMeans++ implementation:

kmeanspp_cpp_vs_sklearn

Disclaimers

This was my first experience with C++, CMake, Pybind, Eigen, and testing. Obviously, there are likely horrible things lying here and there; if you find mistakes, please consider making a PR!

Otherwise, this was mainly to get me acquainted with the previously mentioned tools. This has not the vocation for being push on pip.

TODO

Install

For now, I set the CMake build type manually in pybind.cpp: set(CMAKE_BUILD_TYPE Release). This can cause problem when building with CMake (make sure CMake is called with the same flag as the one in pybind.cpp!). I did this to ensure the setup.py file uses the Release configuration. There must certainly be a way to avoid that.

AFKMC^2

  • Add the possibility to pass an array of random numbers: this is the bottleneck in term of speed.

Clustering

  • Add tests for the clustering algorithms.

Khorn

  • Add the storage order as a template parameter, to allow for matrices stored in column-major.

Python interface

  • Make a better python interface in the Python package.

Builds

Clean once and for all the external librairies. For now, they are installed locally and added to the C++ include path. Fix:

  • GoogleTest: make it download in CMake
  • Eigen, Pybind: put them in the lib.

References

fastclustering's People

Contributors

gaspardbb avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.