Giter Site home page Giter Site logo

renatogeh / gospn Goto Github PK

View Code? Open in Web Editor NEW
23.0 7.0 5.0 26.36 MB

A free, open-source inference and learning library for Sum-Product Networks (SPN)

License: BSD 3-Clause "New" or "Revised" License

Go 95.26% Shell 4.06% C 0.67%
spn deep-learning graph pgm statistics probability golang image-reconstruction natural-language-processing classification

gospn's Introduction

GoSPN

Build Status Go Report Card GoDoc License

My crude (and slightly terrifying) rendition of Renee French's Go Gopher writing what's on his mind.

GoSPN: A Sum-Product Network (SPN) Library

Overview

Sum-Product Networks (SPNs) are deep probabilistic graphical models (PGMs) that compactly represent tractable probability distributions. Exact inference in SPNs is computed in time linear in the number of edges, an attractive feature that distinguishes SPNs from other PGMs. However, learning SPNs is a tough task. There have been many advances in learning the structure and parameters of SPNs in the past few years. One interesting feature is the fact that we can make use of SPNs' deep architecture and perform deep learning on these models. Since the number of hidden layers not only doesn't negatively impact the tractability of inference of SPNs but also augments the representability of this model, it is very much desirable to continue research on deep learning of SPNs.

This project aims to provide a simple framework for Sum-Product Networks. Our objective is to provide inference tools and implement various learning algorithms present in literature.

Roadmap

All

  • Unit tests
  • Support for continuos variables

Inference

  • Soft inference (marginal probabilities)
  • Hard inference (MAP) through max-product algorithm

Structure learning

  • Gens-Domingos learning schema (LearnSPN) [1]
  • Dennis-Ventura clustering structural learning algorithm [2]
  • Poon-Domingos dense architecture [3]

Weight learning

  • Computation of SPN derivatives
  • Soft generative gradient descent
  • Hard generative gradient descent
  • Soft discriminative gradient descent
  • Hard discriminative gradient descent

Input/Output

  • Support for .npy files
  • Support for .arff dataset format (discrete variables only)
  • Support for .csv dataset file format
  • Support for our own .data dataset format
  • Serialization of SPNs

References

  • [1] Learning the Structure of Sum-Product Networks, R. Gens & P. Domingos, ICML 2013
  • [2] Learning the Architecture of Sum-Product Networks Using Clustering on Variables, A. Dennis & D. Ventura, NIPS 25 (2012)
  • [3] Sum-Product Networks: A New Deep Architecture, H. Poon & P. Domingos, UAI 2011

Looking to contribute?

See the Contribution Guidelines.

Branches

  • dev contains the development version of GoSPN.
  • stable contains a stable version of GoSPN.
  • nlp contains deprecated NLP model.

Usage

As a Go library

GoDocs: https://godoc.org/github.com/RenatoGeh/gospn

Learning algorithms are inside the github.com/RenatoGeh/gospn/learn package, with each algorithm as a subpackage of learn (e.g. learn/gens, learn/dennis, learn/poon).

To parse an ARFF format dataset and perform learning with the Gens-Domingos structure learning algorithm:

First import the relevant packages (e.g. learn/gens for Gens' structural learning algorithm, io for ParseArff and spn for inference methods):

import (
  "github.com/RenatoGeh/gospn/learn/gens"
  "github.com/RenatoGeh/gospn/io"
  "github.com/RenatoGeh/gospn/spn"
)

Extract contents from an ARFF file (for now only discrete variables):

name, scope, values, labels := io.ParseArff("filename.arff")

Send the relevant information to the learning algorithm:

S := gens.Learn(scope, values, -1, 0.0001, 4.0, 4)

S is the resulting SPN. We can now compute the marginal probabilities given a spn.VarSet:

evidence := make(spn.VarSet)
evidence[0] = 1 // Variable 0 = 1
// Summing out variable 1
evidence[2] = 0 // Variable 2 = 0
// Summing out all other variables.
p := S.Value(evidence)
// p is the marginal Pr(evidence), since S is already valid and normalized.

The method S.Value may repeat calculations if the SPN's graph is not a tree. To use dynamic programming and avoid recomputations, either use spn.Inference or spn.Storer:

// This only returns the desired probability (in logspace).
p := spn.Inference(S, evidence)

// A Storer stores values for all nodes.
T := spn.NewStorer()
t := T.NewTicket() // Creates a new DP table.
spn.StoreInference(S, evidence, t, T) // Stores inference values from each node to T(t).
p = T.Single(t, S) // Returns the first value inside node S: T(t, S).

Finding the approximate MPE works the same way. Let evidence be some evidence, the MPE is given by:

args, mpe := S.ArgMax(evidence) // mpe is the probability and args is the argmax valuation.

Similarly to S.Value, S.ArgMax may recompute values if the graph is not a tree. Use StoreMAP if the graph is a general DAG instead.

_, args := spn.StoreMAP(S, evidence, t, T)
mpe := T.Single(t, S)

Dependencies

GoSPN is written in Go. Go is an open source language originally developed at Google. It's a simple yet powerful and fast language built with efficiency in mind. Installing Go is easy. Pre-compiled packages are available for FreeBSD, Linux, Mac OS X and Windows for both 32 and 64-bit processors. For more information see https://golang.org/doc/install.

GoNum

We have deprecated GNU GSL in favor of GoNum (https://github.com/gonum/). GoNum is written in Go, meaning when installing GoSPN, the Go package manager should automatically install all dependencies (including GoNum).

In case this does not occur and something like this comes up on the screen:

cannot find package "[...]/gonum/stat" in any of

Enter the following commands:

go get -u gonum.org/v1/gonum/stat
go get -u gonum.org/v1/gonum/mathext

We have deprecated functions that made GoSPN independent of GoNum or GNU GSL, so we recommend installing GoNum.

NpyIO

GoSPN supports .npy NumPy array dataset. We use NpyIO to read the file and reformat into GoSPN dataset format. Go's go get should automatically install NpyIO.

graph-tool (optional)

Graph-tool is a Python module for graph manipulation and drawing. Since the SPNs we'll generate with most learning algorithms may have hundreads of thousands of nodes and hundreds of layers, we need a fast and efficient graph drawing tool for displaying our graphs. Since graph-tool uses C++ metaprogramming extensively, its performance is comparable to a C++ library.

Graph-tool uses the C++ Boost Library and can be compiled with OpenMP, a library for parallel programming on multiple cores architecture that may decrease graph compilation time significantly.

Compiling graph-tool can take up to 80 minutes and 3GB of RAM. If you do not plan on compiling the graphs GoSPN outputs, it is highly recommended that you do not install graph-tool.

Subdependencies and installation instructions are listed at https://graph-tool.skewed.de/download.

Graphviz (optional)

GoSPN also supports graph drawing with Graphviz. See io/output.go.

Compiling and Running GoSPN

To get the source code through Go's go get command, run the following command:

$ go get -u github.com/RenatoGeh/gospn

Then ensure all dependencies are pulled:

cd gospn && go build

Updating GoSPN

To update GoSPN, run:

go get -u github.com/RenatoGeh/gospn

Datasets

For a list of all available datasets in .data format, see:

Results

Some benchmarking and experiments we did with GoSPN. More can be found at https://github.com/renatogeh/benchmarks.

Image classifications

Digits dataset correct classifications

Caltech dataset correct classifications

Image completions with prior face knowledge

Olivetti faces dataset C1 39 completions

Olivetti faces dataset C1 9 completions

Image completions without prior face knowledge

Olivetti faces dataset C2 39 completions

Olivetti faces dataset C2 9 completions

Literature

The following articles used GoSPN!

  • Credal Sum-Product Networks, D. Mauá & F. Cozman & D. Conaty & C. Campos, PMLR 2017
  • Approximation Complexity of Maximum A Posteriori Inference in Sum-Product Networks, D. Conaty & D. Mauá & C. Campos, UAI 2017

Acknowledgements

This project is part of my undergraduate research project supervised by Prof. Denis Deratani Mauá at the Institute of Mathematics and Statistics - University of São Paulo. We had financial support from CNPq grant #800585/2016-0.

We would like to greatly thank Diarmaid Conaty and Cassio P. de Campos, both from Queen's University Belfast, for finding and correcting several bugs.

gospn's People

Contributors

renatogeh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

gospn's Issues

Implement Fisher's Exact Test

Fisher's exact test is a contingency table variable independence test ideal for small sized datasets. GoSPN currently implements Pearson's Chi-Square test and the G-test, both of which provide approximations that are quite bad on smaller sample sizes. Since LearnSPN recursively splits data at each step downsizing samples potentially in an exponential rate, an exact test is preferred.

There are, as of yet (and as far as I know), no implementation of the Fisher exact test for the general mxn contingency table in Go, C or C++. I have found implementations in R and Python, though I'm not sure the latter is correctly implemented (we should definitely implement unit tests here).

The test itself should be placed at utils/indep/fisher.go, and its unit test under utils/indep/fisher_test.go.

TensorFlow integration

We seek to integrate TensorFlow with the goal of implementing RAT-SPNs as proposed in [Peharz et al 2018]. However, as of yet, only Python is covered by the API stability guarantees, meaning it's the only language with the API to build models. Having said that, the Go API is still capable of training the model and running inference.

This means that if we were to add TF integration, we would have to add a Python layer for model creation. Once the model has been created, the user would then be able to export it to GoSPN for learning, running inference or whatever else GoSPN may offer.

This is obviously not ideal, but until Google provides a full API for C++, C or Go, this is the best we can do. Suggestions on how to deal with this are welcome.

References

[Peharz et al 2018] - https://arxiv.org/pdf/1806.01910.pdf

Add CSV dataset support

Description: GoSPN only supports a custom .data and .arff dataset formats. We plan on adding .csv support.

File: io/csv.go

Implement DeriveWeights

Description: DeriveWeights should compute the derivative dS/dW, where W is the multiset of weights in S. It should allow the user to choose a data structure to perform the graph search (e.g. DFS with a stack, BFS with a queue). It also should use Storer as a DP table.

File: learn/derive.go

Implement hard and soft EM clustering for the Gens-Domingos algorithm.

Description: Our Gens implementation currently clusters instances with the DBSCAN and alternatively k-means clustering algorithms. We wish to add the EM clustering as cited in [1]. Additionally, we intend on implementing both hard and soft EM, as explained in the following quote extracted from [1]:

For soft EM, where instances can be fractionally assigned to clusters, T needs to be extended with a weight for each instance, and each instance is passed to each cluster it has nonzero weight in. However, this is considerably less efficient than hard EM, where each instance is wholly assigned to its most probable cluster, and we will use the latter method.

DBSCAN and k-means function like hard EM, assigning instances to the most problable cluster.

Files: learn/gens.go, utils/cluster/em.go

References:

got NaN value

hi @RenatoGeh

I tried run go main.go

I got the following error (sample):
...
Creating new leaf...
Sample size: 321, scope size: 1
Creating new leaf...
Sample size: 321, scope size: 1
Creating new leaf...
Testing instance 0. Should be classified as 8.
Pr(X=0|E) = antilog(NaN) = NaN
Pr(X=1|E) = antilog(NaN) = NaN
Pr(X=2|E) = antilog(NaN) = NaN
Pr(X=3|E) = antilog(NaN) = NaN
Pr(X=4|E) = antilog(NaN) = NaN
Pr(X=5|E) = antilog(NaN) = NaN
Pr(X=6|E) = antilog(NaN) = NaN
Pr(X=7|E) = antilog(NaN) = NaN
Pr(X=8|E) = antilog(NaN) = NaN
Pr(X=9|E) = antilog(NaN) = NaN
Pr(X=10|E) = antilog(NaN) = NaN
Pr(X=11|E) = antilog(NaN) = NaN
Pr(X=12|E) = antilog(NaN) = NaN
Pr(X=13|E) = antilog(NaN) = NaN
Pr(X=14|E) = antilog(NaN) = NaN
...

Is it ok got NaN value?

THanks
hcr

Add a GaussianMixture node

Description: A mixture of k gaussians can be represented as a sum node with k gaussians as children. Learning the weights would then be done with EM.

File: spn/gausmix.go

Improve digits dataset documentation

Description::** Our custom dataset digits has no proper documentation. We should follow add more information regarding the dataset, as shown in [1]. Another idea is to move the dataset to a completely distinct repository, referencing said repository in the GoSPN docs. If this be the case, we should also remove all other datasets (i.e. caltech and olivetti) from this repository.

File: data/digits/README.md

References:

Why use DBSCAN?

Your work is really awesome!

EM is used in LearnSPN, but DBSCAN is used in this project. I wonder Why? Does it have better performance?

Thank you.

Add continuous variables support

Description: Add continuous variable support for both dataset parsing and Gens-Domingos.

Files: io/arff.go, io/csv.go, learn/gens.go

Should the project have its own website?

I believe that the project should have its own website, preferably a simple one built with github pages. I think it's a good way to register activities, improvements and research on the field.

I would be happy to make this happen if approved :)

Implement OPTICS

OPTICS (Ordering Points To Identify the Clustering Structure) is a clustering algorithm similar to DBSCAN. DBSCAN's major weakness is density tuning. OPTICS attempts to address this issue by ordering points and choosing the best epsilon.

We currently have an incomplete OPTICS implementation at utils/cluster/optics.go. LearnSPN relies heavily on both clustering and variable independence, and having OPTICS should increase its performance.

This isn't a priority though, as plenty other more interesting structure learning algorithms have sprung up recently.

Implement DeriveSPN

Description: DeriveSPN should compute the derivative dS/dS_i of each node in an SPN. It should allow the user to choose a data structure to perform the graph search (e.g. DFS with a stack, BFS with a queue). It also should use Storer as a DP table.

File: learn/derive.go

Fix Gens-Domingos complexity analysis

Description: My Gens-Domingos complexity analysis is wrong. It currently only takes into account the current iteration of clustering-independencies. It omits the recursive calls to each child it creates.

File: doc/analysis/analysis.tex

Test coverage

GoSPN currently has little to no coverage of modules. Go has a native unit test library, which we should use.

Though marked as good first issue, we obviously do not expect a single PR with all changes. Instead, please send PRs addressing one checkbox at a time.

The following is our goal coverage list:

  • app module:
    • Classification: should achieve at least a certain percentage of accuracy in an artificial dataset (e.g. DigitsX) (app/image.go)
    • Completion: should not pass a certain mean square error threshold of the original image's crop (app/image.go)
  • common module:
    • HSV to RGB (common/color.go)
    • ApproxEqual (common/equal.go)
    • Queue tests (common/queue.go)
      • Enqueue
      • Dequeue
      • Stress test
    • Stack tests (common/stack.go)
      • Enqueue
      • Dequeue
      • Stress test
  • conc module:
    • SingleQueue tests on race condition (conc/queue.go)
  • data module:
    • Dataset manipulation (data/manipulate.go)
      • Cascade Rounding
      • ExtractLabels
      • Partition
      • PartitionByLabels
      • Split
      • Copy
      • Shuffle
      • Join
      • SubtractLabel
      • SubtractVariable
      • Identical
      • MergeLabel
      • Divide
    • Dataset download test
  • io module:
    • ARFF dataset (io/arff.go)
      • ARFFToData
      • ParseArff
    • HTTP (io/http.go)
      • DownloadFromURL
    • Image I/O (io/images.go)
      • SplitHalf
    • DATA dataset (io/input.go)
      • ParseData
      • ParseDataNL
      • ParseEvidence
      • ParsePartitionedData
    • NPY dataset (io/npy.go)
      • ReadBalanced
      • ReadAll
      • Read
      • Reset
    • Output (io/output.go)
      • DrawGraphTools
      • DrawGraph
    • SPN I/O (io/spn.go)
      • SaveSPN
      • LoadSPN
  • learn module:
    • SPN derivation (learn/derive.go)
      • DeriveSPN
      • DeriveWeights
      • DeriveWeightsBatch
      • DeriveApplyWeights
      • DeriveHard
    • Discriminative gradient descent (learn/discriminative.go)
      • DiscriminativeGD
      • DiscriminativeHardGD
      • DiscriminativeBGD
      • DiscriminativeHardBGD
      • applyDGD
      • storeDGD
      • applyDGDFrom
      • applyHDGD
    • Generative gradient descent (learn/generative.go)
      • GenerativeGD
      • GenerativeHardGD
      • GenerativeBGD
      • GenerativeHardBGD
      • applyFastGD
      • applyGD
      • applyFastHGD
      • applyHGD
    • Variable and scope (learn/variable.go)
      • GobEncode
      • GobDecode
      • ExtractInstance
      • CompleteDataToMatrix
      • DataToMatrix
      • MatrixToData
      • ReflectScope
      • CopyScope
      • DataToVarData
    • Dennis-Ventura clustering structure learning
    • Gens-Domingos LearnSPN structure learning
    • Poon-Domingos dense architecture
  • score module:
    • Confusion matrix
    • Score registration
  • spn module:
    • Gaussian node (spn/gaussian.go)
    • Product node (spn/product.go)
    • Sum node (spn/sum.go)
    • Indicator node (spn/indicator.go)
    • Multinomial node (spn/multinom.go)
    • Breadth and depth first search (spn/search.go)
    • SPN serialization (spn/serial.go)
    • Dynamic programming storing (spn/storer.go)
    • Topological sorting (spn/topo.go)
    • Exact inference bottom-up pass (spn/utils.go)
      • Inference
      • InferenceY
      • StoreInference
    • Max-product algorithm (spn/utils.go)
      • StoreMAP
      • TraceMAP
    • NormalizeSPN (spn/utils.go)
    • ComputeHeight (spn/utils.go)
    • ComputeScope (spn/utils.go)
    • Complete (spn/utils.go)
    • Decomposable (spn/utils.go)
  • sys module:
    • RandComb (sys/rand.go)
  • utils module:
    • Logarithmic operations (utils/log.go)
      • LogSumLog
      • LogSum
      • LogProd
      • LogSumPair
      • Trim
      • LogSumExp
      • LogSumExpPair
    • Statistic functions (utils/stats.go)
      • Mean
      • StdDev
      • MuSigma
      • PartitionQuantiles
    • UnionFind (utils/unionfind.go)
    • Clustering algorithms
      • Auxiliary clustering functions (utils/cluster.go)
      • DBSCAN
      • K-means
      • K-medoid
      • K-mode
      • OPTICS
    • Variable independence tests
      • Chi-square Pearson test
      • G-test
      • Independence graph

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.