Giter Site home page Giter Site logo

matrix_fact's Introduction

MatrixFact

Install

pip install matrix-fact

What is matrix-fact?

matrix-fact contains modules for constrained/unconstrained matrix factorization (and related) methods for both sparse and dense matrices. The repository can be found at https://github.com/gentaiscool/matrix_fact. The code is based on https://github.com/ChrisSchinnerl/pymf3 and https://github.com/rikkhill/pymf. We updated the code to support the latest library. It requires cvxopt, numpy, scipy and torch. We just added the support for PyTorch-based SNMF.

Packages

The package includes:

  • Non-negative matrix factorization (NMF) [three different optimizations used]
  • Convex non-negative matrix factorization (CNMF)
  • Semi non-negative matrix factorization (SNMF)
  • Archetypal analysis (AA)
  • Simplex volume maximization (SiVM) [and SiVM for CUR, GSAT, ... ]
  • Convex-hull non-negative matrix factorization (CHNMF)
  • Binary matrix factorization (BNMF)
  • Singular value decomposition (SVD)
  • Principal component analysis (PCA)
  • K-means clustering (Kmeans)
  • C-means clustering (Cmeans)
  • CUR decomposition (CUR)
  • Compaxt matrix decomposition (CMD)
  • PyTorch SNMF

Usage

Given a dataset, most factorization methods try to minimize the Frobenius norm |data - W*H| by finding a suitable set of basis vectors W and coefficients H. The syntax for calling the various methods is quite similar. Usually, one has to submit a desired number of basis vectors and the maximum number of iterations. For example, applying NMF to a dataset data aiming at 2 basis vectors within 10 iterations works as follows:

>>> import matrix_fact
>>> import numpy as np
>>> data = np.array([[1.0, 0.0, 2.0], [0.0, 1.0, 1.0]])
>>> nmf_mdl = matrix_fact.NMF(data, num_bases=2, niter=10)
>>> nmf_mdl.initialization()
>>> nmf_mdl.factorize()

The basis vectors are now stored in nmf_mdl.W, the coefficients in nmf_mdl.H. To compute coefficients for an existing set of basis vectors simply copy W to nmf_mdl.W, and set compW to False:

>>> data = np.array([[1.5], [1.2]])
>>> W = np.array([[1.0, 0.0], [0.0, 1.0]])
>>> nmf_mdl = matrix_fact.NMF(data, num_bases=2, niter=1, compW=False)
>>> nmf_mdl.initialization()
>>> nmf_mdl.W = W
>>> nmf_mdl.factorize()

By changing py_fact.NMF to e.g. py_fact.AA or py_fact.CNMF Archetypal Analysis or Convex-NMF can be applied. Some methods might allow other parameters, make sure to have a look at the corresponding >>>help(py_fact.AA) documentation. For example, CUR, CMD, and SVD are handled slightly differently, as they factorize into three submatrices which requires appropriate arguments for row and column sampling.

For PyTorch-SNMF

>>> data = torch.FloatTensor([[1.5], [1.2]])
>>> nmf_mdl = matrix_fact.NMF(data, num_bases=2)
>>> nmf_mdl.factorize(niter=1000)

Very large datasets

For handling larger datasets py_fact supports hdf5 via h5py. Usage is straight forward as h5py allows to map large numpy matrices to disk. Thus, instead of passing data as a np.array, you can simply send the corresponding hdf5 table. The following example shows how to apply py_fact to a random matrix that is entirely stored on disk. In this example the dataset does not have to fit into memory, the resulting low-rank factors W,H have to.

>>> import h5py
>>> import numpy as np
>>> import matrix_fact
>>>
>>> file = h5py.File('myfile.hdf5', 'w')
>>> file['dataset'] = np.random.random((100,1000))
>>> sivm_mdl = matrix_fact.SIVM(file['dataset'], num_bases=10)
>>> sivm_mdl.factorize()

If the low-rank matrices W,H also do not fit into memory, they can be initialized as a h5py matrix.

>>> import h5py
>>> import numpy as np
>>> import matrix_fact
>>>
>>> file = h5py.File('myfile.hdf5', 'w')
>>> file['dataset'] = np.random.random((100,1000))
>>> file['W'] = np.random.random((100,10))
>>> file['H'] = np.random.random((10,1000))
>>> sivm_mdl = matrix_fact.SIVM(file['dataset'], num_bases=10)
>>> sivm_mdl.W = file['W']
>>> sivm_mdl.H = file['H']
>>> sivm_mdl.factorize()

Please note that currently not all methods work well with hdf5. While they all accept hdf5 input matrices, they sometimes lead to very high memory consumption on intermediate computation steps. This is difficult to avoid unless we switch to a completely disk-based storage.

matrix_fact's People

Contributors

gentaiscool avatar samuelcahyawijaya avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

tertius94 edgeber

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.