Giter Site home page Giter Site logo

lanl / pydnmfk Goto Github PK

View Code? Open in Web Editor NEW
17.0 5.0 5.0 12.72 MB

Python Distributed Non Negative Matrix Factorization with custom clustering

Home Page: https://lanl.github.io/pyDNMFk/

License: BSD 3-Clause "New" or "Revised" License

Python 99.90% Shell 0.10%
tensorfactorization nonnegative-matrix-factorization distributed-computing hpc mpi4py latent-features cupy machine-learning nccl outofmemory

pydnmfk's Introduction

Build Status License Python Version DOI


pyDNMFk is a software package for applying non-negative matrix factorization in a distributed fashion to large datasets. It can minimize the difference between reconstructed data and the original data through various norms (Frobenius, KL-divergence). Additionally, the Custom Clustering algorithm allows for automated determination for the number of Latent features.


plot

Features:

  • Utilization of MPI4py for distributed operation.
  • Distributed NNSVD and SVD initializations.
  • Distributed Custom Clustering algorithm for estimating automated latent feature number (k) determination.
  • Objective of minimization of KL divergence/Frobenius norm.
  • Optimization with multiplicative updates, BCD, and HALS.
  • Checkpoints for tracking runtime status enabling restart from saved state.
  • Distributed Pruning of zero rows and zero columns of the data.

plot

Overview of the pyDNMFk workflow implementation.

Installation:

On a desktop machine:

git clone https://github.com/lanl/pyDNMFk.git
cd pyDNMFk
conda create --name pyDNMFk python=3.7.1 openmpi mpi4py
source activate pyDNMFk
python setup.py install

On a HPC server:

git clone https://github.com/lanl/pyDNMFk.git
cd pyDNMFk
conda create --name pyDNMFk python=3.7.1 
source activate pyDNMFk
module load <openmpi>
pip install mpi4py
python setup.py install

Prerequisites

  • conda
  • numpy>=1.2
  • matplotlib
  • MPI4py
  • scipy
  • h5py

Documentation

You can find the documentation here.

Usage

main.py can be used to run the software on command line:

mpirun -n <procs> python main.py [-h] [--process PROCESS] --p_r P_R --p_c P_C [--k K]
               [--fpath FPATH] [--ftype FTYPE] [--fname FNAME] [--init INIT]
               [--itr ITR] [--norm NORM] [--method METHOD] [--verbose VERBOSE]
               [--results_path RESULTS_PATH] [--checkpoint CHECKPOINT]
               [--timing_stats TIMING_STATS] [--prune PRUNE]
               [--precision PRECISION] [--perturbations PERTURBATIONS]
               [--noise_var NOISE_VAR] [--start_k START_K] [--end_k END_K]
               [--step_k STEP_K] [--sill_thr SILL_THR] [--sampling SAMPLING]


arguments:
  -h, --help            show this help message and exit
  --process PROCESS     pyDNMF/pyDNMFk
  --p_r P_R             Now of row processors
  --p_c P_C             Now of column processors
  --k K                 feature count
  --fpath FPATH         data path to read(eg: tmp/)
  --ftype FTYPE         data type : mat/folder/h5
  --fname FNAME         File name
  --init INIT           NMF initializations: rand/nnsvd
  --itr ITR             NMF iterations, default:1000
  --norm NORM           Reconstruction Norm for NMF to optimize:KL/FRO
  --method METHOD       NMF update method:MU/BCD/HALS
  --verbose VERBOSE
  --results_path RESULTS_PATH
                        Path for saving results
  --checkpoint CHECKPOINT
                        Enable checkpoint to track the pyNMFk state
  --timing_stats TIMING_STATS
                        Switch to turn on/off benchmarking.
  --prune PRUNE         Prune zero row/column.
  --precision PRECISION
                        Precision of the data(float32/float64/float16).
  --perturbations PERTURBATIONS
                        perturbation for NMFk
  --noise_var NOISE_VAR
                        Noise variance for NMFk
  --start_k START_K     Start index of K for NMFk
  --end_k END_K         End index of K for NMFk
  --step_k STEP_K       step for K search
  --sill_thr SILL_THR   SIll Threshold for K estimation
  --sampling SAMPLING   Sampling noise for NMFk i.e uniform/poisson

Example on running pyDNMFk using main.py:

mpirun -n 4 python main.py --p_r=4 --p_c=1 --process='pyDNMFk'  --fpath='data/' --ftype='mat' --fname='swim' --init='nnsvd' --itr=5000 --norm='kl' --method='mu' --results_path='results/' --perturbations=20 --noise_var=0.015 --start_k=2 --end_k=5 --sill_thr=.9 --sampling='uniform'

Example estimation of k using the provided sample dataset:

'''Imports block'''
import pyDNMFk.config as config
config.init(0)
from pyDNMFk.pyDNMFk import *
from pyDNMFk.data_io import *
from pyDNMFk.dist_comm import *
from scipy.io import loadmat
from mpi4py import MPI
comm = MPI.COMM_WORLD
args = parse()  


'''parameters initialization block'''

# Data Read here
args.fpath = 'data/'
args.fname = 'wtsi'  
args.ftype = 'mat'
args.precision = np.float32

#Distributed Comm config block
p_r, p_c = 4, 1  

#NMF config block
args.norm = 'kl'
args.method = 'mu'
args.init = 'nnsvd'
args.itr = 5000
args.verbose = True

#Cluster config block
args.start_k = 2 
args.end_k = 5
args.sill_thr = 0.9

#Data Write
args.results_path = 'results/'


'''Parameters prep block'''
comms = MPI_comm(comm, p_r, p_c)
comm1 = comms.comm
rank = comm.rank
size = comm.size
args.size, args.rank, args.comm, args.p_r, args.p_c = size, rank, comms, p_r, p_c
args.row_comm, args.col_comm, args.comm1 = comms.cart_1d_row(), comms.cart_1d_column(), comm1
A_ij = data_read(args).read().astype(args.precision)

nopt = PyNMFk(A_ij, factors=None, params=args).fit()
print('Estimated k with NMFk is ',nopt)

Example on running pyDNMFk to get the W and H matrices:

# Use "mpirun -n 4 python -m code.py" to run this example
from pyDNMFk.runner import pyDNMFk_Runner
import numpy as np

runner = pyDNMFk_Runner(itr=100, init='nnsvd', verbose=True, 
                        norm='fro', method='mu', precision=np.float32,
                        checkpoint=False, sill_thr=0.6)

results = runner.run(grid=[4,1], fpath='data/', fname='wtsi', 
                     ftype='mat', results_path='results/',
                     k_range=[1,3], step_k=1)

W = results["W"]
H = results["H"]

See the examples or tests for more use cases.


Benchmarking

plot Figure: Scaling benchmarks for 10 iterations for Frobenius norm based MU updates with MPI operations for i) strong and ii) weak scaling and Communication vs computation operations for iii) strong and iv) weak scaling.

Scalability

plot

Authors

How to cite pyDNMFk?

  @misc{pyDNMFk,
  author = {Bhattarai, Manish and Nebgen, Ben and Skau, Erik and Eren, Maksim and Chennupati, Gopinath and Vangara, Raviteja and Djidjev, Hristo and Patchett, John and Ahrens, Jim and ALexandrov, Boian},
  title = {pyDNMFk: Python Distributed Non Negative Matrix Factorization},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4722448},
  howpublished = {\url{https://github.com/lanl/pyDNMFk}}
}


@article{vangara2021finding,
  title={Finding the Number of Latent Topics With Semantic Non-Negative Matrix Factorization},
  author={Vangara, Raviteja and Bhattarai, Manish and Skau, Erik and Chennupati, Gopinath and Djidjev, Hristo and Tierney, Tom and Smith, James P and Stanev, Valentin G and Alexandrov, Boian S},
  journal={IEEE Access},
  volume={9},
  pages={117217--117231},
  year={2021},
  publisher={IEEE}
}

 @inproceedings{bhattarai2020distributed,
  title={Distributed Non-Negative Tensor Train Decomposition},
  author={Bhattarai, Manish and Chennupati, Gopinath and Skau, Erik and Vangara, Raviteja and Djidjev, Hristo and Alexandrov, Boian S},
  booktitle={2020 IEEE High Performance Extreme Computing Conference (HPEC)},
  pages={1--10},
  year={2020},
  organization={IEEE}
}
@inproceedings {s.20211055,
booktitle = {EuroVis 2021 - Short Papers},
editor = {Agus, Marco and Garth, Christoph and Kerren, Andreas},
title = {{Selection of Optimal Salient Time Steps by Non-negative Tucker Tensor Decomposition}},
author = {Pulido, Jesus and Patchett, John and Bhattarai, Manish and Alexandrov, Boian and Ahrens, James},
year = {2021},
publisher = {The Eurographics Association},
ISBN = {978-3-03868-143-4},
DOI = {10.2312/evs.20211055}
}
@article{chennupati2020distributed,
  title={Distributed non-negative matrix factorization with determination of the number of latent features},
  author={Chennupati, Gopinath and Vangara, Raviteja and Skau, Erik and Djidjev, Hristo and Alexandrov, Boian},
  journal={The Journal of Supercomputing},
  pages={1--31},
  year={2020},
  publisher={Springer}
}

Acknowledgments

Los Alamos National Lab (LANL), T-1

Copyright Notice

© (or copyright) 2020. Triad National Security, LLC. All rights reserved. This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration. All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.

License

This program is open source under the BSD-3 License. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

pydnmfk's People

Contributors

ceodspspectrum avatar maksimekin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pydnmfk's Issues

Matrix input file format questions

First of all, thank you for making your software available as open source. I've installed this library for a research client and they have their matrices stored as an anndata object in CSR format in an hdf5 file. From what I can tell, your data_io.py function read in matrices as csv, mat (matlab?), npy, npz formats.
Is there a way/plan to read anndata csr from hdf5? Any recommendations to convert to npy (csv would be too large)?
And for the currently available formats, are they reading in csr sparse? Can you preprocess the A matrix to be distributed into multiple files, or does A have to be read from a single file and then distributed?

Mac issue

I've installed this on both a Mac laptop and desktop and it runs the swim problem just fine with mpirun using 4 processors. I then try to run my own problems with the command line

mpirun -n 4 python /Users/edward/pyDNMFk/main.py --p_r=4 --p_c=1 --process='pyDNMFk' --fpath='data/' --ftype='csv' --fname='HMX' --init='nnsvd' --itr=5000 --norm='kl' --method='mu' --results_path='results/34' --perturbations=100 --noise_var=0.015 --start_k=3 --end_k=4 --step_k=1 --sill_thr=.9 --sampling='uniform' --prune=true > log34.out &

And I get

/Users/edward/pyDNMFk/pyDNMFk/pyDNMF.py:238: RuntimeWarning: invalid value encountered in true_divide
col_err = np.sqrt(col_err_num / col_err_deno)

4 times (or for each number of processors I requested). Output files are full of data, but the selection plot is messed up. Anything obvious?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.