Giter Site home page Giter Site logo

salar96 / mep-orthogonal-nmf Goto Github PK

View Code? Open in Web Editor NEW
7.0 1.0 2.0 205 KB

Clustering and resource allocation using Deterministic Annealing Approach and Orthogonal Non-negative Matrix Factorization O-(NMF)

License: MIT License

Python 100.00%
clustering clustering-algorithm clustering-analysis nmf nmf-decomposition orthogonal outlier-detection onmf deterministic-annealing anomaly-detection resource-allocation constrained-optimization data-analysis matrix-factorization nonnegative-matrix-factorization nonnegativity-constraints sparse-representations

mep-orthogonal-nmf's Introduction

MEP-ONMF

Maximum-Entropy-Principle Orthogonal Non-negative Matrix Factorization

Deterministic Annealing Clustering (Maximum-Entropy-Principle Clustering)

Deterministic Annealing (DA) is a clustering, or in a more accurate definition, a facility location algorithm that clusters data points into several groups of different sizes, so that the cumulative distance of each data point to its assigned resource is minimized (for details on this algorithm, please refere to our paper mentioned above). To cluster data into $k$ clusters, just use the code snippet here:

parameter description
k number of clusters we want to partition our data in
tol The tolerence at which bifurcation happens for the critical cluster - smaller values more stable, but results in a slower performance
max_iter maximum number of iterations that the code waits until the convergence of parameters
alpha the value at which $\beta$ is multiplied by at the end of each convergence. Smaller values results in a more stable performance, but slows down the code
purturb the rate at which we purturb new cluster centers
beta_final the final $\beta$ value that we want to stop the algorithm at. If not specified, it is determined automatically
verbos whether to print out iterations' information
normalize whether to 'l2' normalize the input data
input description
X Data matrix $X \in \mathbb{R}^{d \times n}$ where $d$ is each data point's dimension and $n$ is the number of data points
output description
Y cluster centers (resource locations) $Y \in \mathbb{R}^{d \times k}$
P associations matrix $P \in \mathbb{R}^{k \times n}$ determining the belonging of each data point to each cluster center (resource)
{ 
  from Deterministic_Annealing import DA
  model=DA(K,tol=1e-4,max_iter=1000,alpha=1.05,
              purturb=0.01,beta_final=None,verbos=0,normalize=False)
  model.fit(X,Px='auto')
  Y,P=model.cluster()
  print(model.return_cost())
}

For example, for a dataset that consists of 16 psudu-symmetric clusters as illustrated if Fig 1, the DA finds the cluster centers:

{ 
  model.plot()
}

Where in this animation below, you can see how by increasing the value of $\beta$ these cluster centers split. In the beginning, there is only one cluster center, and as the value of $\beta$ reaches a critical value for a cluster center, that cluster center will bifurcate, Here animation_frame is equivalent to $\beta$ values.

{ 
  model.animation()
}
Clustering.mp4

These critical $\beta$ values provide useful information about our dataset. By plotting the log of these values using the command below, we can determine the true number of clusters in our dataset:

{ 
  model.plot_criticals(log=True)
}

index

By looking at this diagram, we can see that there are large gaps at 2,4,8 and 16 between these critical $\beta$ values. These show that, depending at the resolution you want to look at your dataset with, there are 2,4,8 or 16 clusters there. However, the largest gap occurs at 16, so it means that there are 16 clusters in this dataset, that can be taken as the true number of clusters. To get an accurate plot, it is advised to use small values for alpha and purturb. Also using :

{ 
  model.return_true_number()
}

returns the value 16 for this dataset.

The command mesh_plot specifies which cluster center the data belong to. For example, for a given dataset of 15 clusters, we get:

{ 
  model.mesh_plot()
}

index

The Mass-Constrained DA allows capturing unbalanced datasets where one or more clusters have significantly lower number of data points compared with the other clusters. The weight of each cluster can be shown using the pie_chart command. This is very useful for outlier detection applications. For example, for a given dataset like:

index

{ 
  model.pie_chart()
}

index

The resulting pie chart indicates that the cluster Y_6 is an outlier.


Orthogonal Non-negative Matrix Factorization (ONMF)

The orthogonal NMF (ONMF) poses the following problem:

$$ \begin{align} \min \quad &D(X,WH) \\ \textrm{s.t.} \quad &W^{\intercal}W=I \quad \text{or} \quad HH^{\intercal}=I \\ & W_{ij},H_{ij} \geq 0 \quad \forall \ i,j \end{align} $$

DA can also be used to solve the ONMF problem.

input description
X Data matrix $X \in \mathbb{R}^{d \times n}_{+}$ where $d$ is each data point's dimension and $n$ is the number of data points
output description
W Features matrix $W \in \mathbb{R}^{d \times k}_{+}$
H Mixing matrix $H \in \mathbb{R}^{k \times n}_{+}$
model DA model associated with the data matrix. Can be used to identify the true number of features
{ 
  from onmf_DA import ONMF_DA
  W,H,model=ONMF_DA.func(X,k,alpha=alpha,purturb=p,tol=tol)
}

For $H$-orthogonal, put $X$ and for $W$-orthogonal, put $X^{\intercal}$ as input.

mep-orthogonal-nmf's People

Contributors

salar96 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

muskap coganlab

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.