Giter Site home page Giter Site logo

filiptirnanic96 / t-sne Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 0.0 16.4 MB

Implementation of t-SNE and Barnes-Hut-SNE algorithm. Comparison of algorithm implementation with sklearn library implementation on sample databases.

Python 100.00%
approximate-nearest-neighbor-search barnes-hut barnes-hut-algorithm barnes-hut-tsne gradient-descent heapsort kd-tree maxheap quadtree tree-structure

t-sne's Introduction

t-Distributed Stochastic Neighbor Embedding (t-SNE algorithm)

Table of contents

  1. t-SNE algorithm
  2. Barnes-Hut-SNE algorithm
  3. t-SNE and Barnes-Hut-SNE project structure
  4. Databases and results
  5. References

t-SNE algorithm

t-SNE algorithm represents non linear dimensionality reduction method which maps high-dimensional data X = {x1, x2, ..., xn} in low dimensional data Y = {y1, y2, ..., yn} such that we preserve as much information as possible. This is done by making pairwise similarities pij between high dimensional points X as similar as possible to pairwise similarities qij between low dimensional points Y.
Pairwise similarities pij are defined with Gaussian distributions:


Where each of Gaussian variances centered in data point xi we can obtain by binary search of sigma_i for predefined perplexity (neighborhood points):


Pairwise similarities qij are defined with Student t-distribution:


Student t-distribution is used to overcome "Crowding Problem". Similarities between same points are set to zero so pij and qij are set to 0.
For mapping to be successful we want that these high dimensional distributions pij are as same as possible to low dimensional distributions qij. Hance, Kullback-Leibler divergence is used as criterium function which is minimized:

KL divergence is minimized using gradient decent algorithm, with adaptive learning rate and momentum.

Pseudo code od the algorithm can be found below:

Barnes-Hut-SNE algorithm

Time and memory complexity of t-SNE algorithm is O(N^2), where N is number of data points, which is not appropriate for datasets with more than few thousand points. Barnes-Hut-SNE is approximation of t-SNE algorithms which requires O(NlogN) time and memory complexity and can be used for large datasets. Barnes-Hut-SNE uses 2 approximations:
 1. Approximation of input data similarity pij
 2. Approximation of t-SNE gradient calculation

First approximation is done using k-dimensional (k-d) tree for finding the first p=3*perplexity neighbors of each input data point. Complexity of this approach is O(NlogN). Example k-d tree constructed on synthetic data is presented in picture below:


t-SNE KL gradient is defined with formula:


KL Gradient can be represented as follows:


F_attr can be calculated in O(pN) time complexity. F_rep requires O(N^2) time complexity which we can reduce to O(NlogN) using Barnes-Hut approximation. Barnes-Hut-SNE constructs quad tree on output (low-dimensional) data Y and in each iteration of calculation of F_rep it decides if current node can be used as summary of contribution to F_rep for all the data inside that node. Example quad tree constructed on synthetic data is presented in picture below:


t-SNE and Barnes-Hut-SNE project structure

Project structure can be found in picture below. Module tsne_algorithm is the core module and has t-SNE and Barnes-Hut-SNE implemented using only numpy. Implementation of tree structures (k-d tree and quad tree) can be found in module tsne_algorithm/trees. Module tsne_test_script has scripts for testing t-SNE implementation and it is used for results visualisation. Folder dataset contains data used for testing t-SNE implementation. Test data can be obtaind just by unzipping 7z files.


Databases and results

Two databases are use as test of tsne and Barnes-Hut-SNE implementation:
 1. MNIST - 6000 images of digits 0-9 with resolution 28x28. Sample images can be found below


 2. Olivetti faces - 400 images of 40 different people with resolution 64x64. Sample images can be found below.


Results on MNIST dataset on 4000 samples of t-SNE "exact" method


Results on MNIST dataset on 1000 samples of Barnes-Hut-SNE method


Results on Olivetti faces dataset of t-SNE "exact" method


Results on Olivetti faces dataset of Barnes-Hut-SNE method


References

Implementation of t-SNE algorithm references
t-SNE algorithm:
https://distill.pub/2016/misread-tsne/
https://towardsdatascience.com/t-sne-clearly-explained-d84c537f53a
https://jeremy9959.net/Blog/tsne_annotated-fixed/
http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
Adaptive learning rate update:
https://web.cs.umass.edu/publication/docs/1987/UM-CS-1987-117.pdf
K-d tree:
https://upcommons.upc.edu/bitstream/handle/2117/76382/hpgm_15_1.pdf
https://arxiv.org/pdf/cs/9901013.pdf
https://en.wikipedia.org/wiki/K-d_tree
Neighbors Heap:
https://www.hackerearth.com/practice/notes/heaps-and-priority-queues/
Quad tree
https://en.wikipedia.org/wiki/Quadtree
Barnes-Hut approximation in t-SNE:
https://arxiv.org/pdf/1301.3342.pdf
https://jheer.github.io/barnes-hut/

t-sne's People

Contributors

filiptirnanic96 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.