Giter Site home page Giter Site logo

py-sgtl's Introduction

I’m a Lecturer in the School of Computer Science at the University of St Andrews. My research interests are broadly in algorithms for data science and machine learning and I always aim to develop theoretically sound algorithms which work well in practice. Please see my website for more information on my work.

If you'd like help with any of my work, or you'd just like to connect, then please do get in touch using the contact details on my website.

py-sgtl's People

Contributors

pmacg avatar yonmaor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

py-sgtl's Issues

Generating graphs from the SBM is slow

The methods for generating graphs from the stochastic block model are a little slow. It would be nice to improve the implementation to speed them up.

Add a method for computing the number of connected components in a graph

It would be useful to have a single method for computing the number of connected components in a graph. Possibly by finding the number of eigenvalues of the Laplacian equal to 0 - might not be fast though!

Probably better to use a 'proper' connected components algorithm :) It's basically just repeated graph traversals.

Graph methods should give meaningful errors when vertices out of range

Currently, if you ask for the volume of a set of vertices, but the set includes indices which are greater than the number of vertices in the graph, the code throws an IndexError from somewhere inside the graph code.

This is not totally unreasonable, but it would help debugging this kind of error if it gave a clear error message. Something like: "Vertex set includes indices larger than number of vertices".

Add spectrum method to graph

Add the methods

  • adjacency_spectrum()
  • laplacian_spectrum()
  • normalised_laplacian_spectrum()

To the graph object for returning the eigenvalues of the corresponding graph matrices.

Create labelled graph object

Add an object which extends the graph class to include labels for each vertex. Consider allowing arbitrary data associated with each vertex - maybe a dictionary of key-value pairs.

Look into behaviour of methods on graphs with negative weights

It may be sometimes useful to use graphs with negative weights. The library is not currently designed with this in mind, but many things might 'just work'. This issue covers investigating formally the behaviour of the library on graphs with negative weights.

Graph matrices are not symmetric for an undirected graph

For a large undirected graph generated from the stochastic block model, the generated graph matrices are not always symmetric. For example:

>>> big_graph = sgtl.random.ssbm(1000, 5, 0.8, 0.2)
>>> lap_mat = big_graph.normalised_laplacian_matrix()
>>> lap_mat_dense = lap_mat.toarray()
>>> np.allclose(lap_mat_dense, lap_mat_dense.T)
False

I haven't debugged this yet, but we should certainly be able to assume that the graph matrices are symmetric for an undirected graph.

Typo in return value of bipartiteness method

The return value of the Graph.bipartiteness() method should display a beta in math mode in the rendered documentation. See the return value of the conductance method for how to do this.

Construct tensor-product graph

Add a method for constructing the tensor product of two graphs.

Consider overloading the '*' operator, so you can write graph3 = graph1 * graph2.

Add implementation of the cheeger cut

This issue covers moving the implementation of the Cheeger cut and the Trevisan cut from https://github.com/pmacg/pycheeg into SGTL.

Should include a fix for the issue raised under that project.

Combine graphs by adding their edges

Add a method to add two graphs together, equivalent to adding their adjacency matrices.

If the graphs do not have the same number of vertices, then this method should throw a suitable exception.

Consider overloading the '+' operator so that you can write graph3 = graph1 + graph2.

Benchmarking and optimisation of basic graph operations

Many methods in the library are much slower than we'd like them to be. This issue covers doing some benchmarking and testing to identify the bottlenecks in the basic graph operations and implementing some speed-ups.

An ideal goal would be for basic operations (such as computing the conductance of a vertex set) on a graph with 1m vertices to be fast enough to be responsive to a user (e.g. sub 1 second) - is this achievable?

Generating SBM with different cluster sizes can fail

The following code for generating a graph from the stochastic block model fails with out-of-bounds errors.

import sgtl.random
import numpy as np


def main():
    cluster_sizes = [100, 50, 20, 20, 20]
    p = 0.8
    prob_mat_q = np.asarray([[p,   0.2, 0.5, 0.1, 0.1],
                             [0.2, p,   0.1, 0.6, 0.2],
                             [0.5, 0.1, p,   0.2, 0.1],
                             [0.1, 0.6, 0.2, p,   0.5],
                             [0.1, 0.2, 0.1, 0.5, p]])
    graph = sgtl.random.sbm(cluster_sizes, prob_mat_q)


if __name__ == '__main__':
    main()
Traceback (most recent call last):
  File ".../main.py", line 17, in <module>
    main()
  File ".../main.py", line 13, in main
    graph = sgtl.random.sbm(cluster_sizes, prob_mat_q)
  File ".../venv/lib/python3.8/site-packages/sgtl/random.py", line 179, in sbm
    adj_mat[vertex_1, vertex_2] = 1
  File ".../venv/lib/python3.8/site-packages/scipy/sparse/_lil.py", line 330, in __setitem__
    return self._set_intXint(key[0], key[1], x)
  File ".../venv/lib/python3.8/site-packages/scipy/sparse/_lil.py", line 299, in _set_intXint
    _csparsetools.lil_insert(self.shape[0], self.shape[1], self.rows,
  File "_csparsetools.pyx", line 61, in scipy.sparse._csparsetools.lil_insert
  File "_csparsetools.pyx", line 87, in scipy.sparse._csparsetools.lil_insert
IndexError: column index (240) out of bounds

Generating graphs from the SBM with a sparse array as the probability matrix fails.

Minimal working example

The error message thrown depends on the precise sparse matrix format. For example, here are two slightly different behaviours.

Case 1

>>> import sgtl.random
>>> import scipy.sparse
>>> prob_mat = scipy.sparse.eye(5)
>>> graph = sgtl.random.sbm_equal_clusters(100, 5, prob_mat)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/peter/.local/lib/python3.8/site-packages/sgtl/random.py", line 204, in sbm_equal_clusters
    return sbm([int(n/k)] * k, prob_mat_q, directed=directed)
  File "/home/peter/.local/lib/python3.8/site-packages/sgtl/random.py", line 176, in sbm
    for (vertex_1, vertex_2) in _generate_sbm_edges(cluster_sizes, prob_mat_q, directed=directed):
  File "/home/peter/.local/lib/python3.8/site-packages/sgtl/random.py", line 108, in _generate_sbm_edges
    prob_mat_q[cluster_1_idx][cluster_2_idx])
TypeError: 'dia_matrix' object is not subscriptable

Case 2

>>> import sgtl.random
>>> import scipy.sparse
>>> prob_mat = scipy.sparse.eye(5).tocsr()
>>> graph = sgtl.random.sbm_equal_clusters(100, 5, prob_mat)
TypeError: float() argument must be a string or a number, not 'csr_matrix'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/peter/.local/lib/python3.8/site-packages/sgtl/random.py", line 204, in sbm_equal_clusters
    return sbm([int(n/k)] * k, prob_mat_q, directed=directed)
  File "/home/peter/.local/lib/python3.8/site-packages/sgtl/random.py", line 176, in sbm
    for (vertex_1, vertex_2) in _generate_sbm_edges(cluster_sizes, prob_mat_q, directed=directed):
  File "/home/peter/.local/lib/python3.8/site-packages/sgtl/random.py", line 107, in _generate_sbm_edges
    num_edges = np.random.binomial(cluster_1_size - (vertex_2 - c1_base_index),
  File "mtrand.pyx", line 3362, in numpy.random.mtrand.RandomState.binomial
ValueError: setting an array element with a sequence.

Workaround

It is possible to work around this problem in the client code by calling the toarray() method on the probability matrix when passing it to the SBM code.

>>> import sgtl.random
>>> import scipy.sparse
>>> prob_mat = scipy.sparse.eye(5).tocsr()
>>> graph = sgtl.random.sbm_equal_clusters(100, 5, prob_mat.toarray())

The num_edges member of the Graph object is incorrect when graph is weighted

Given a weighted graph object, the graph.num_edges gives half of the total volume of the graph, rather than the actual number of weighted edges.

This should be replaced with two methods on the object, a num_edges method which gives the number of non-zero elements of the adjacency matrix (corrected for the symmetry) and a graph_volume method which returns the total weight of all the edges in the graph.

Plot spectral embedding

Add a method for drawing a graph, using its spectral embedding. May just call out to e.g. networkx code to do the drawing.

Check graph methods are correct for directed graphs

The sgtl.graph.Graph object has been written with undirected graphs in mind, but is expected to work for directed graphs too.

This issue covers double checking the directed graph cases. In particular, it covers:

  • writing tests for each Graph method for the directed graph cases
  • fixing up any problems that are found and
  • updating the documentation to make it clear what the behaviour is for directed graphs

Conductance of the empty set should return clear error message

Requesting the conductance (or bipartiteness) of an empty set should return a meaningful error message to the client.

Minimal working example

>>> import sgtl.graph
>>> graph = sgtl.graph.cycle_graph(10)
>>> graph.conductance([])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/peter/wc/py-sgtl/sgtl/graph.py", line 144, in conductance
    return 1 - (2 * self.weight(vertex_set_s, vertex_set_s, sets_are_equal=True)) / self.volume(vertex_set_s)
ZeroDivisionError: division by zero

For the bipartiteness case:

>>> graph.bipartiteness([], [])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/peter/wc/py-sgtl/sgtl/graph.py", line 159, in bipartiteness
    return 1 - 2 * self.weight(vertex_set_l, vertex_set_r) / self.volume(vertex_set_l + vertex_set_r)
ZeroDivisionError: division by zero

Desired Behaviour

Something like:

>>> import sgtl.graph
>>> graph = sgtl.graph.cycle_graph(10)
>>> graph.conductance([])
<snip>
ValueError: conductance of the empty set is undefined

Convert between networkx and graph objects

Add two methods to the sgtl.graph.Graph object to convert to and from the networkx graph type:

@staticmethod
def from_networkx(nx_graph):
    pass

def to_networkx(self):
    pass

Update to work with sparse arrays

The scipy sparse linear algebra library has recently introduced sparse arrays rather than matrices and is encouraging their use over the matrix types.

This enhancement covers updating SGTL to allow use of either the sparse array or sparse matrix types.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.