dmlc / dgl Goto Github PK

View Code? Open in Web Editor NEW

13.2K 13.2K 3.0K 28.88 MB

Python package built to ease deep learning on graph, on top of existing DL frameworks.

Home Page: http://dgl.ai

License: Apache License 2.0

Python 62.49% Shell 0.27% CMake 0.59% C++ 22.64% C 0.27% Batchfile 0.05% Cuda 6.28% Cython 0.21% Jupyter Notebook 7.19%

deep-learning graph-neural-networks

dgl's Introduction

Distributed Machine Learning Common Codebase

DMLC-Core is the backbone library to support all DMLC projects, offers the bricks to build efficient and scalable distributed machine learning libraries.

Developer Channel

What's New

Note on Parameter Module for Machine Learning

Known Issues

RecordIO format is not portable across different processor endians. So it is not possible to save RecordIO file on a x86 machine and then load it on a SPARC machine, because x86 is little endian while SPARC is big endian.

Contributing

Contributing to dmlc-core is welcomed! dmlc-core follows google's C style guide. If you are interested in contributing, take a look at feature wishlist and open a new issue if you like to add something.

DMLC-Core uses C++11 standard. Ensure that your C++ compiler supports C++11.
Try to introduce minimum dependency when possible

CheckList before submit code

Type make lint and fix all the style problems.
Type make doc and fix all the warnings.

NOTE

deps:

libcurl4-openssl-dev

dgl's People

Contributors

Stargazers

Watchers

Forkers

jalamao sufeidechabei zhouyonglong jhy1993 chenghuige hyzcn dthboyd iszhaoxin libertatis hulalazz szha renly zheng-da onisimchukv renyi533 anonymous0429 giserh jermainewang zch42 chaoso chaoyue729 frankchu0229 lizimingchina allenanswerzq tornadomeet flylearning breakdawn yzh119 shaunstanislauslau lingfanyu fengziyjun crazyjvm yfyang86 ginobilinie bozheng-hit eric-haibin-lin kevinthesun ahoyosid sumeetkr ml-lab truongbuu hungz23 beesitech zhizhid hhy5277 ashwarypande nguyenducnhaty hsouporto franlucchini earlbabson stevenlol o7s8r6 vdeepak13 primus-zhao dailyactie ncantrell narayanmahto ziyuehuang mahshad92 cclauss jaykimbravekjh sungjinlees yflyzhang rishotics hdjsjyl fendaq mrunix gaoyz0625 barclayii vikingmew zhouyy92 astonzhang catyans zhuyiche hecosysu yf817 ruizewang pythseq patriciaxiao brettkoonce zcrwind subhankar21 szldmgy gaiyu0 tbmihailov csgcmai uranuszs handasontam ustc-miner azamatb yongwww cvignac mori97 codeaudit ashuein chocl8camellia gaimjkp gnn2qsu vishalbelsare yanjunli-cs

dgl's Issues

[BUG] Consecutive sends without recv

Here's the test I was running on cpp branch.

def test_send_twice():
    g = DGLGraph()
    g.add_nodes(2)
    g.add_edge(0, 1)
    def _message_a(src, edge):
        return {'a': src['a']}
    def _message_b(src, edge):
        return {'b': src['b']}
    def _reduce(node, msgs):
        return {'a': msgs['a'].sum(1), 'b': msgs['b'].sum(1)}

    old_a = th.randn(2, 5)
    old_b = th.randn(2, 5)
    g.set_n_repr({'a': old_a, 'b': old_b})

    g.send(0, 1, _message_a)
    g.send(0, 1, _message_b)
    g.recv([1], _reduce)

    new_repr = g.get_n_repr()

    assert th.allclose(new_repr['a'][1], old_a[0])
    assert th.allclose(new_repr['b'][1], old_b[0])

First, if I do two consecutive sends with different message keys, DGLGraph throws a KeyError: 'b' regardless of whether we are sending on the same node pairs/edges.

The next issue is that we are adding edges into a "message graph" during sends. This does not make a lot of sense on multigraphs if we allow sending between the same node pairs twice without recv, which will leave us with two edges in the message graph. Actually, in the test case above, I stepped into the _batch_recv method and found that the degree is 2 in the degree bucketing procedure.

So do we want to allow (1) consecutive sends without recv, and (2) consecutive sends between the same node pairs without recv?

If we were to allow both, I guess ultimately we would want to have the message graph represented as an edge subgraph, or somehow maintain a set of "already-added-edges" in the message graph.

Thoughts?

Two issues in set_e_repr()

The following code fails:

import networkx as nx
from dgl import DGLGraph
import torch

G = nx.DiGraph()
G.add_nodes_from([0, 1, 2])
G.add_edges_from([(0, 1), (0, 2)])

erepr = torch.randn(2, 20)

DG = DGLGraph(G)
DG.set_e_repr({'features': erepr}, [0, 0], [1, 2])

There are two causes:
(1) Converting a DiGraph to a DGLGraph does not initialize the edge list appropriately. An easy fix would be passing graph_data into _init_state() and set self._edge_list = list(graph_data.edges). Although I'm not sure what the order of edges should be (I don't think it matters).
(2) In set_e_repr():

        if u_is_all:
            # ...
        else:
            eid = self.cached_graph.get_edge_id(u, v)
            if isinstance(h_uv, dict):
                for key, val in h_uv.items():
                    self._edge_frame[key] = F.scatter_row(self._edge_frame[key], eid, val)
            else:
                self._edge_frame[__REPR__] = F.scatter_row(self._edge_frame[__REPR__], eid, h_uv)

Setting a new field with given u and v will refer to a non-existent edge frame.
Which leads to the question: do we want to allow partial set of edge/node frames before initialization? And it is not even "partial" setting in the example above; the user is merely specifying the edge order by themselves.

State sharing between graphs

Let us discuss how to enable state sharing between graphs.

This page describes four kinds of semantics for graph duplication.

In the application of CDGNN (and many others), we are primarily interested in the "independent shallow" kind of semantics, which ensures that

g.nodes[node]['x'] is g.copy().nodes[node]['x']

The challenge is that we have to ensure this not only after duplication, but also after updating shared representations in either g or g.copy() (persistent sharing).

We can override copy, add_node_from and add_edge_from of networkx to let users tell us which representations to share. For example, after executing

h = g.copy(share=['x'])

the 'x' attribute will be shared between g and h, even if g or h is updated. We have to consider the implication of adding/deleting node/edge/attribute as well. The following is my suggestion:

Add a node/edge/attribute: in order to be consistent with the "independent shallow" semantics of networkx, this should only affect the graph that user adds node/edge/attribute to.
Delete a node/edge/attribute: I am not sure whether this is necessary. Maybe it suffices to override nx.subgraph as we override copy.

One possible implementation is to maintain a tensor containing shared representations and in-place update the tensor whenever an update of representation occurs. There are at least two problems with this:

In-place update of a tensor breaks autograd.
update_func may change the shape of representation, in which case we must create a new tensor for shared representations. After creating a new tensor, we have to update every graph attribute that refers to the previous tensor. Extra bookkeeping is necessary to know which attributes to update.

However, a similar solution may work. Pseudo-code:

class DGLGraph(nx.DiGraph):
  # ...
  def copy(self, as_view=False, share=[]):
    # For simplicity, assume that only node attributes may be shared.
    # The following line works if we override d.__getitem__, see below.
    for n, d in self.nodes.items():
      self.storage[n].update({x : d[x] for x in share})
    for d in self.nodes.values():
      for x in share:
        d[x] = __SHARED__
    g = super().copy(as_view)
    g.storage = self.storage
    return g

We override d.__getitem__ to lookup self.storage if an attribute equals __SHARED__. d.__setitem__ is override similarly.

Sending and receiving multiple fields (batched)

There are several instances where messages and node reductions are over more than node['REPR'], for example:

FastGCN has a node weight vector which where q_i is applied on message v_i->v_j
Control variate sampling has a 'previous' representation on node reduction, e.g. sum(h[i], h_prev[i])

Presumably the best way to handle this is to pass in optional fields to the send_and_recv and related functions which batch a dictionary of tensors and returns the node['REPR'] batching on None.

Does this seem like a good strategy for this? I need to address this for 1, 2.

set_e_repr(): behavior description when u and v equals ALL

It seems that for now set_e_repr() directly assigns the batched edge tensor storage when u == v == ALL. But how do I construct such a tensor storage? In particular, when u == v == ALL, which edge does each row of the tensor storage correspond to?

[BUG] Multi-Head Attention dimension error when calling built-in reduce function

When writing an attention module, a combination of fn.src_mul_edge and fn.sum is required, the first one for computing the attention score and the second one for computing the weighted sum of values(by attention score).

There are three ways to implement this:

def src_mul_edge_udf(edges):
    return {'sum': edges.src['h'] * edges.data['h']}

def sum_udf(nodes):
    return {'h': nodes.mailbox['sum'].sum(1)}

g.update_all(message_func=fn.src_mul_edge('h', 'h', 'sum'), reduce_func=fn.sum('sum', 'h')) # 1
g.update_all(message_func=src_mul_edge_1, reduce_func=fn.sum('sum', 'h')) # 2
g.update_all(message_func=src_mul_edge_1, reduce_func=sum_1) # 3

they all works fine when the node feature and edge feature are both 1d vectors.

However, when it comes to Multi-Head Attention, the node features are 2d tensors rather than a 1d vector.

import dgl
import torch as th
import dgl.function as fn

g = dgl.DGLGraph()
g.add_nodes(10)
for i in range(10):
    g.add_edges(i, range(10))

g.ndata['h'] = th.rand(10, 10, 5) # (N, number_of_heads, dim)
g.edata['h'] = th.rand(100, 10, 1) # (M, number_of_head, 1)

def src_mul_edge_1(edges):
    return {'sum': edges.src['h'] * edges.data['h']}

def sum_1(nodes):
    return {'h': nodes.mailbox['sum'].sum(1)}

#g.update_all(message_func=fn.src_mul_edge('h', 'h', 'sum'), reduce_func=fn.sum('sum', 'h'))
#g.update_all(message_func=src_mul_edge_1, reduce_func=fn.sum('sum', 'h'))
g.update_all(message_func=src_mul_edge_1, reduce_func=sum_1)

Then only the third statement works, which corresponds to using udf for both message_func and reduce_func. When calling the first two statements, we will get error message:

  File "/home/zy1404/repos/dgl/python/dgl/graph.py", line 1324, in update_all
    Runtime.run(prog)
  File "/home/zy1404/repos/dgl/python/dgl/runtime/runtime.py", line 8, in run
    exe.run()
  File "/home/zy1404/repos/dgl/python/dgl/runtime/ir/executor.py", line 252, in run
    C = F.spmm(spA, B)
  File "/home/zy1404/repos/dgl/python/dgl/backend/pytorch/tensor.py", line 109, in spmm
    return th.spmm(x, y)
RuntimeError: addmm: matrices expected, got 3D tensor

I wonder if that is the expected behavior?

[Doc] add more documents

define the signature of UDFs.
documents for NodeBatch and EdgeBatch. These are classes exposed to users.

[BUG] Loop detected in BatchedDGLGraph, but no loop found in each DGLGraph

I'm running the latest version of TreeLSTM code, but I met a bug when using the built-in topological traversal.

The problem is that, apply topological_nodes_generator on each graph in the batch does not yield any error, but when it comes to the batched graph, it says loop detected in the given graph.

More specifically, only 995 nodes visited but there are 999 nodes in the batched graph overall. I've checked the edge number and node number of both the batched graph(999, 975) and each graph in the batch, they're both ok.

I guess the problem is with the BatchedDGLGraph, could you guys worked on this module please help locate the bug? To reproduce the bug, you may use the code here and run python train.py --gpu 0.

@BarclayII @ylfdq1118 @jermainewang

[TODO] Remove anonymous fields?

In retrospect, I think the anonymous fields unnecessarily complicates the code base.
Allowing anonymous fields requires us additional if/else logic to decide whether to return/update a dictionary or a tensor (and as a result, we have two helper functions _get_repr() and _set_e_repr() solely for dealing with anonymous attributes), which are fundamentally different. Similar logic is also appearing in executors (see scheduler.py). I would suspect that we would have such kind of code appearing all over the place as we add in more features.
Moreover, the semantics of get_n_repr(), set_n_repr(), etc., has subtleties if anonymous field is present. Basically, we are still allowing anonymous fields and explicit fields to co-exist. Thus, if a user sets an anonymous field and then adds in another explicit field, he/she is unable to get the anonymous field again by calling get_n_repr(). (This is related to #21)
So my suggestion would be removing the anonymous fields altogether, and force the users to always pass in dictionaries, which shouldn't be a heavy burden for users but would simplify our code base.

To mutable or not to mutable?

Here is a question we found when writing models using DGL. Let's first look at a following GCN example:

import torch.nn as nn
from dgl import DGLGraph
# Node update module
class NodeModule(nn.module):
  def __init__(self):
    self.linear = nn.Linear(in_features=100, out_features=10)
  def forward(self, node, accum):
    return nn.relu(self.linear(accum))
# GCN module
class GCN(nn.Module):
  def __init__(self):
    self.updmod = NodeModule()
  def forward(self, g):
    g.update_all(message_func='from_src',
                 reduce_func='sum',
                 update_func=self.updmod,
                 batchable=True)
    return g
# main loop
nx_graph, features, labels = load_data('cora')
gcn = GCN()
g = DGLGraph(nx_graph)
g.set_n_repr(features)
for epoch in range(MAX_EPOCH):
  gg = gcn(g)
  logits = gg.get_n_repr()
  loss = NLLLoss(logits, labels)
  loss.backward()
  # ...

Can you spot the bug in this code?

The bug is that we need to reset the input features to the graph at the beginning of each iteration:

# ...
g = DGLGraph(nx_graph)
for epoch in range(MAX_EPOCH):
  g.set_n_repr(features)  # Need to reset the features every iter!!
  gg = gcn(g)
# ...

This is a very subtle mistake, but we feel that this worth our attention. The reason of such mistake is that we are very used to what autograd DNN frameworks provide us -- immutability. In Pytorch, all tensor operations are immutable (i.e, they always return new tensors). As a result, following Pytorch code is totally fine:

X = # init features
for epoch in range(MAX_EPOCH):
  y = MyModel(X)
  loss = MyLossFunc(y)
  loss.backward()

As a contrast, because DGL is derived from networkx, our APIs are mutable. For example, the update_all does not return a new graph. It in fact changes the node representations internally. Because of this, even if user write gg = gcn(g), the gg and g are pointing to the same DGLGraph and the node reprs have been updated after update_all.

This is an inherent conflict of networkx and Pytorch/TF/MX. I want to know about the opinions from model developers (@GaiYu0 @BarclayII @ivanbrugere @zzhang-cn ). What do you guys think? Do you find this bug very subtle or actually a cunning pitfall? Do you like the current mutable graph or prefer immutable objects?

Here, I also want to share more about what if we want to support immutable graph object. What are the challenge and solution? To make DGLGraph immutable, we need to handle two things:

Transformation on node/edge features.
Transformation on graph structures.

For transformations on node/edge features (e.g. sendto, recv, update_all, ...), we can return a new graph containing different data as follows. In the following example, note that the update_all function returns a new graph g2. Initially the graph (i.e, g1) has three node features a1, a2, a3. The node update function reads all node features but only update a1 attribute. As a result, the new g2 node storage (called node frame) reuses the other two feature columns but use the newly generated a1 column. Such change to the system should have little overhead.

For transformations on graph structures, this is a little tricky. For users that are used to networkx, they actually expect mutable behaviors as follows:

import networkx as nx
g = nx.path_graph(3)
print(g.edges())  # (0, 1), (1, 2)
h = g
h.add_edge(0, 2)
print(g.edges())  # (0, 1), (0, 2), (1, 2)

We can change it to immutable by using Copy-On-Write. The question is shall we?

[DISCUSSION] Subgraph semantics

As mentioned in our last meeting, there are a couple of things to think about when implementing subgraph.

Does a subgraph support graph mutation? Can such mutation be seen by its parent graph?
We need to support both shared and non-shared mode on node/edge frames.
What if a subgraph is created from another subgraph? Who is its parent?

@ylfdq1118 and I have a couple of discussions last week. Here, we write down our thoughts.

For the python side subgraph class:

It needs to maintain a reference to its parent DGLGraph.
It needs to maintain node/edge mappings to its parent DGLGraph.

For graph index:

Currently, a subgraph is read-only on graph structure. So mutation is not allowed.
A new graph index will be created when a subgraph is created. There is no need to design fancy view data structure like networkx.

For node/edge frame:

Subgraph will have a copy_from_parent API that takes arguments of which node/edge features need to be copied. The API is used to copy node/edge features from its parent graph. [Note: name is tentative]
For "shared" mode, the copy_from_parent function will do nothing because the subgraph and parent graph already share the underlying frames.
For "non-shared" mode, the user needs to call copy_from_parent to explicitly copy node/edge features from parent graph.
- If the user tries to get node/edge features before copy_from_parent, s/he will get nothing.
- If the subgraph already has its own node/edge features, the copy_from_parent will override them.
- Any update on the subgraph's node/edge features will not be seen by the parent graph. And thus, the memory consumption is of the order of the subgraph size.
- To write the subgraph's node/edge features back to parent graph. There are two options:
  - The subgraph can have another write_to_parent API to write node/edge features back.
  - A functional API merge (currently here), to merge multiple subgraphs back to one parent. This should be more efficient than calling write_to_parent one by one.

For creating a subgraph from another subgraph:

Suppose we have a subgraph B created from another subgraph A, and subgraph A's parent is P.
If A is created in "shared" mode, then B can be viewed as created directly from P. (no need to maintain the tree hierarchy).
If A is created in "non-shared" mode, then B's parent is A. This means copy_from_parent and write_to_parent will only look at the node/edge frame of A.
- To ease the lookup of multiple hierarchy, we could have a recursive flag in the two calls to indicate that such read/write will recursively applied in the subgraph hierarchy.

@zheng-da please have a look.

[RFC] C++ graph interface

The draft interface is here: https://github.com/jermainewang/dgl/blob/cpp/include/dgl/graph.h
It covers the requirements from networkx and current DGL's cached graph.

Currently, we have several assumptions on the graph:

Vertices/edges are indexed by integers starting from zero.
Deletion of vertices/edges is not allowed.
Attributes are not stored in the graph. Instread, they are stored in Frames.

The idea is to lower some graph manipulation logic to C++ side. For example, it should speedup adding multiple edges at once (i.e, DGLGraph::AddEdges), finding neighbors (i.e, DGLGraph::InEdges or DGLGraph::OutEdges) and subgraph creation (i.e, DGLGraph::Subgraph).

The graph is represented by an adjacency list structure.

Please leave your comments.

Loopy belief propagation

The loopy belief propagation updates messages on a directed edge e = (u, v) by looking at the messages on all edges pointing to u, except the one that comes from v. An example from Junction-Tree VAE looks like this:

where m's are messages, x are input features and W are weights (the latter two should be fine).
The readout would be something like

and the output vector being the average of all h_u's.

I can't think of a direct way to do loopy belief propagation like this within the current framework.

A possible workaround for this particular phase I can think of is to divide the process into two stages, maintaining a field called msg_sum, containing the sum of incoming message for each node :

Send phase: compute m_uv like this:

As you can see, this require us to query the message on the reverse edge, which I wonder if is possible. AFAIK the message function does not allow you to look at edges other than the ones to be updated.
Recv phase: compute s_u as the sum of all incoming m_uv.
[EDIT] Final readout phase: we can directly use s_u to compute the h's.

Any ideas?

About the semantic of propagation

I met a problem when implementing the propagate method. If there is a graph like this: 0->1, 0->2, 1->2. If we call list(nx.bfs_edges(g, 0)), it gives edges [(0, 1), (0, 2)]. The edge (1, 2) is ignored. I wonder what is the right semantic for the bfs propagation in GNN? What are the messages to be computed?

Partial update with changing tensor shapes

Consider following node update function:

# Node update module
class NodeModule(nn.module):
  def __init__(self):
    self.linear = nn.Linear(in_features=100, out_features=10)
  def forward(self, node, accum):
    return nn.relu(self.linear(accum))

The update module transform the node features of length 100 to new features of length 10. Suppose we apply this update function to a subset of nodes, we have following issue:

In this example, only nodes 3,4,5 are updated. As a result, their node features are transformed to vector of length 10 (in the red part). In this case, they cannot be packed together with other node reprs in one tensor.

Feature request: auto-batching preset tensor storages during construction from networkx

Just for booking this issue raised in discussion.

Reference: https://github.com/BarclayII/icml18-jtnn/blob/master/jtnn/mpn.py#L87-L161

In particular, for initializing a DGLGraph where each node and edge gets its own feature vector, I have an "ideal" snippet which looks roughly like this: basically adding nodes and edges along with their feature vectors one by one to a networkx.Graph, then converting this thing to a DGLGraph

def mol2dgl_ideal(smiles_batch):
    '''
    Get a batched DGLGraph object from the given batch of SMILES
    '''
    graphs = []
    for smiles in smiles_batch:
        mol = get_mol(smiles)
        graph = Graph()
        for atom in mol.GetAtoms():
            graph.add_node(atom.GetIdx(), features=atom_features(atom))
        for bond in mol.GetBonds():
            graph.add_edge(
                    bond.GetBeginAtom().GetIdx(),
                    bond.GetEndAtom().GetIdx(),
                    features=bond_features(bond),
                    )
        graphs.append(graph)

    graphs = nx.disjoint_union_all(graphs)
    return DGLGraph(graphs)

The problem is that these feature vectors are not batched and therefore we cannot do batched computation within our current implementation. The workaround is to keep track of the node and edge features by ourselves, and batch-set them after construction:

def mol2dgl(smiles_batch):
    graphs = []
    atom_feature_list = []
    bond_feature_list = []
    bond_begin_nodes = []
    bond_end_nodes = []
    n_nodes = 0
    for smiles in smiles_batch:
        mol = get_mol(smiles)
        graph = Graph()
        for atom in mol.GetAtoms():
            graph.add_node(atom.GetIdx())
            atom_feature_list.append(atom_features(atom))
        for bond in mol.GetBonds():
            begin_idx = bond.GetBeginAtom().GetIdx()
            end_idx = bond.GetEndAtom().GetIdx()
            features = bond_features(bond)
            graph.add_edge(begin_idx, end_idx)
            bond_feature_list.append(features)
            bond_begin_nodes.append(begin_idx + n_nodes)
            bond_end_nodes.append(end_idx + n_nodes)
            # set up the reverse direction
            bond_feature_list.append(features)
            bond_begin_nodes.append(end_idx + n_nodes)
            bond_end_nodes.append(begin_idx + n_nodes)

        graphs.append(graph)

    graphs = nx.disjoint_union_all(graphs)
    dgl_graph = DGLGraph(graphs)
    dgl_graph.set_n_repr({'features': torch.stack(atom_feature_list)})
    dgl_graph.set_e_repr(
            {'features': torch.stack(bond_feature_list)},
            bond_begin_nodes,
            bond_end_nodes,
            )

    return dgl_graph

A mechanism to automatically batch-store the tensors during construction, or something alike, would be preferred. (not among the highest priority though)

[DISCUSSION] [RFC] Multi-edge support

Had a discussion with @ylfdq1118 today. Here is the conclusion we have reached:

Storage

In our current codebase, we store the graph as adjacency lists (node to node). We propose to directly change the storage in C++ backend to incidence lists, more specifically, an incidence list for inbound edges and another for outbound ones. This implies that we are no longer handling simple graphs and multi-graphs separately like in NetworkX.
This will affect the implementation of the following backend methods:

pred and succ: Essentially, we will have to find the inbound/outbound edges of the given vertices first, then look up the source/target vertex of those edges (we can also have a predecessor/successor list in parallel to the inbound/outbound edges, but this is essentially duplicating the edge list and I'm not sure whether we would like it). These methods should support a boolean argument unique to dictate whether we expect a unique list of vertices (useful for graph traversal) or not (useful for deciding source/target nodes for message and apply_edge functions). EDIT4: this should be fine.
~~vertex_subgraph: Similar to pred/succ, we need to do two-step queries.~~ EDIT4: this should also be fine.
get_eid~~, in_edges, out_edges~~: Now it would return a list of edge IDs instead of one.
get_eids, batch_in_edges, batch_out_edges: We haven't go through the semantics of this method yet. I would suggest returning a list of lists if possible, following the same practice as igraph.Graph.neighborhood(). Returning a padded array (along with the lengths) can also do the job if we wish to enable padding (which is a separate functionality we wish to consider I assume). EDIT3: this should be fine since we are returning (source, target, edge ID) tuples.

EDIT4: after all, the changes in C++ interface only involves changing the returning type of get_eid to IdArray and batch_get_eid to EdgeArray.

User Interface

Right now, the following user interfaces assumes that no multi-edge exist:

set_e_repr
get_e_repr
apply_edges
send (and consequently send_and_recv, pull, push, and update_all)
- Note that pull, push and update_all only assumes that in the implementation due to usage of send; the interface itself is fine.
update_edge

Since the interfaces expects to take in and return tensors, we think that having multi-edges will complicate the semantics significantly for get_e_repr and set_e_repr in particular: the users often have to track of how many edges are between two given vertices (e.g. Given a multi-graph ((0, 1), (0, 1), (1, 2)), set_edge_repr([0, 1], [1, 2]) will actually require a tensor with 3 rows, quite unnatural).

EDIT2: Our suggestion is to

For get_e_repr and set_e_repr, simply raise an error if multi-edge exists between any of the given source/target pairs. We can have nicer semantics later.
For other methods, simply send/apply on all edges in between.
Have a counterpart working on edge IDs for apply_edges, send, send_and_recv, and update_edge (e.g. sendon) as well (set_e_repr and get_e_repr already have their edge ID counterparts), to allow sending/applying on individual edges.

We concluded that switching to incidence matrices in the backend should not break compatibility on the current user interfaces.

Side note: Filtering vertices/edges

Since RGCN already involves edges with different "types", which are represented by integers, we can expect that later on we will have models requiring to perform graph operations on a "type" of node/edge at a time. To be even more general, the users can do stuff on a subset of nodes/edges satisfying a certain predicate. We think it will be nice to have a filtering function which returns a subset of given nodes/edges that satisfies a given predicate. For example:

# returns all edges which has the field t equal to 2
eids = g.filter_edges(ALL, lambda e: e['t'] == 2)
# returns the nodes within @nids that has a maximum component greater than 1
# equivalent to something like nids[g.get_n_repr()['h'][nids].max(1) > 1]
nids = g.filter_nodes(nids, lambda n: n['h'].max(1) > 1)

Of course, the signature does not have to be exactly like this.

gSPMV Optimization

One thing we have noticed is that, if we use incidence matrices in our storage, then builtin reduce functions can be always represented by a gSPMV between the inbound incidence matrix and the message tensors.

However, if the message function is also a builtin, we don't want to actually broadcast the messages onto edges since it will potentially take up much more space. So

If both message and reduce functions are builtins, we still stick to gSPMV on adjacency matrix if possible.
Otherwise, if the reduce function is a builtin, we can do gSPMV on inbound incidence matrix.

EDIT: gSPMV on adjacency matrix with multi-edges is more subtle. Since multi-edges are analogous to duplicate entries in COO sparse matrix, depending on the message and reduce function form the actual implementation can be different:

For message function copy_src/src_mul_edge and reduce function sum, on PyTorch we can still stick with existing SPMM multiplications by directly constructing the (weighted) adjacency matrix from edge lists; PyTorch will automatically coalesce duplicated entries by summing them up. I don't know if this is the case for MXNet and Tensorflow.
For other cases, it is very likely that we need to write our own operators to coalesce the duplicate entries. For instance, if the reduce function is max, the coalescing operation is no longer a sum but something like max pooling, so direct sparse tensor construction in PyTorch no longer works.

RGCN example

Under the proposed changes, we expect that the RGCN code can be greatly simplified, and our behavior could match the official implementations. Namely, the message function is simply a batched dot product between the source tensor and the weight matrices corresponding to the edge types, while the reduce function is just a sum (@ylfdq1118 correct me if I'm wrong):

def msg(src, edges, W):
    # W is a 3D Tensor (n_types, in_features, out_features) inside a closure or a parameter in a PyTorch module
    return src[:, None] @ W[edges['type']]

reducer = builtin.sum

Comments and questions are welcome.

[DISCUSSION] Pros and cons of the new signature of MessageFunc

The current interface of MessageFunc is

def MessageFunc(src, edge)

while its previous interface was

def MessageFunc(src, dst, edge)

The main advantage of the new interface is

many algorithms don't need the embedding of the destination vertices to generate messages. If we always pass the embedding of the destination vertices, we needs to fetch them from the Frame, which is unnecessary overhead.

The disadvantage of the new interface is

the expressiveness. There exist some algorithms (e.g., GAT, Genipath, Interaction Networks) that require both source and destination vertices to compute messages. With the new interface, the option of implementing these algorithms is to move all logics to the reduce function or use the edge update function.

The problem of moving all logics to the reduce function is that we actually prefer to keep the reduce function simple as discussed last time. For example, if the reduce function is simple, we can easily convert it into per-edge reduce function internally (this can be achieved in the Gluon symbol mode) to achieve better performance and reduce memory consumption.

The problem of using the edge update function is the memory complexity. If we need to generate edge data explicitly, which results in the memory complexity of O(|E|), while the algorithms can achieve O(|V|). Reducing memory complexity is very important in terms of both performance and scalability.

Another thing is that the new interface of MessageFunc and the one of EdgeUpdateFunc are now inconsistent for an unobvious reason. This might confuse users.

[DISCUSSION] JTVAE profiling result on GPU is up

See https://docs.google.com/spreadsheets/d/11ZttVv7J8tq7d5Qq9MuCxrY-5P3zT5BP0P12oLyamts/edit?usp=sharing

TLDR; DGL should work a lot faster after we switch to C++ graph backend, but it may still have trouble to catch up the official implementation since the latter has a lot less overhead, including (1) not explicitly constructing line graph for tree message passing, and (2) always uses padding and dense sum instead of sparse sum for reduction.

[DISCUSSION] Overhead generated by TVM

I uploaded the JTNN profile result here. It's a JSON file generated by pyinstrument and it can be visualized using Python Flame Chart tool.

I found that among 401s of forward propagation including graph construction, 76s are spent on _make_tvm_args call alone, and another 10.6s on todgltensor call which converts an index to a TVMArray. Seems that the TVM backend is causing a non-trivial overhead in my case.

Probably it's time to switch to Cython and see how it goes?

[Improvement] SSE consumes too much memory during forward

Just a simple note. It seems that there are memory issue when gradient is not required. Need verify.

Is dependency on nltk intended?

[DISCUSSION] GraphIndex profiling result

Here is a profiling result on my machine (Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz). The code is in https://github.com/jermainewang/dgl-workspace/tree/master/profile-graph-index . It measure the time of each API calls with different graph sizes. All the time is measured in seconds.

Some highlights:

Adding 1M nodes takes 0.1s.
Adding 1M nodes and 5M edges takes 2.63s. Clear it takes ~1s.
Querying the edge ids of 5M edges takes 0.53s.
Querying in edges of 1M nodes (5M edges) takes 0.1s.
Geting all the edges of a 1M graph takes 50ms. (The cost is simply the memory copying from the internal vector to an IdArray).
Node subgraph:
- Creating a 10K size subgraph out of a 1M graph (E=5M) takes 0.0147s.
- Creating a 100K size subgraph out of a 1M graph (E=5M) takes 0.4841s.
Creating the adjacency matrix in pytorch of a 100K graph takes 0.0063s. Creating the incidence matrix in pytorch of a 100K graph takes 0.0127s.
Batching 100 graphs each of 50 size takes 0.0051s.

@zheng-da @BarclayII @ylfdq1118 @GaiYu0 Please see which one is the potential problem in current models.

Profiling add nodes: 100
Profiling add nodes: 1000
Profiling add nodes: 10000
Profiling add nodes: 100000
Profiling add nodes: 1000000

=== add nodes ===
+----------+----------+
| 1.0E+02  | 0.0000   |
+----------+----------+
| 1.0E+03  | 0.0001   |
+----------+----------+
| 1.0E+04  | 0.0009   |
+----------+----------+
| 1.0E+05  | 0.0106   |
+----------+----------+
| 1.0E+06  | 0.1023   |
+----------+----------+

Profiling clear nodes: 100
Profiling clear nodes: 1000
Profiling clear nodes: 10000
Profiling clear nodes: 100000
Profiling clear nodes: 1000000

=== clear nodes ===
+----------+----------+
| 1.0E+02  | 0.0000   |
+----------+----------+
| 1.0E+03  | 0.0000   |
+----------+----------+
| 1.0E+04  | 0.0000   |
+----------+----------+
| 1.0E+05  | 0.0005   |
+----------+----------+
| 1.0E+06  | 0.0059   |
+----------+----------+

Profiling add edges: 500
Profiling add edges: 5000
Profiling add edges: 50000
Profiling add edges: 500000
Profiling add edges: 5000000

=== add edges ===
+----------+----------+
| 5.0E+02  | 0.0001   |
+----------+----------+
| 5.0E+03  | 0.0007   |
+----------+----------+
| 5.0E+04  | 0.0084   |
+----------+----------+
| 5.0E+05  | 0.2007   |
+----------+----------+
| 5.0E+06  | 2.6291   |
+----------+----------+

Profiling clear edges: 500
Profiling clear edges: 5000
Profiling clear edges: 50000
Profiling clear edges: 500000
Profiling clear edges: 5000000

=== clear edges ===
+----------+----------+
| 5.0E+02  | 0.0000   |
+----------+----------+
| 5.0E+03  | 0.0002   |
+----------+----------+
| 5.0E+04  | 0.0014   |
+----------+----------+
| 5.0E+05  | 0.0613   |
+----------+----------+
| 5.0E+06  | 0.9252   |
+----------+----------+

Profiling has nodes: 5
Profiling has nodes: 50
Profiling has nodes: 500
Profiling has nodes: 5000
Profiling has nodes: 50000

=== has nodes (5%) ===
+----------+----------+
| 5.0E+00  | 0.0000   |
+----------+----------+
| 5.0E+01  | 0.0000   |
+----------+----------+
| 5.0E+02  | 0.0000   |
+----------+----------+
| 5.0E+03  | 0.0000   |
+----------+----------+
| 5.0E+04  | 0.0001   |
+----------+----------+

Profiling has edges: 25
Profiling has edges: 250
Profiling has edges: 2500
Profiling has edges: 25000
Profiling has edges: 250000

=== has edges (5%) ===
+----------+----------+
| 2.5E+01  | 0.0000   |
+----------+----------+
| 2.5E+02  | 0.0000   |
+----------+----------+
| 2.5E+03  | 0.0001   |
+----------+----------+
| 2.5E+04  | 0.0008   |
+----------+----------+
| 2.5E+05  | 0.0165   |
+----------+----------+

Profiling edge ids: 500
Profiling edge ids: 5000
Profiling edge ids: 50000
Profiling edge ids: 500000
Profiling edge ids: 5000000

=== edge ids ===
+----------+----------+
| 5.0E+02  | 0.0001   |
+----------+----------+
| 5.0E+03  | 0.0002   |
+----------+----------+
| 5.0E+04  | 0.0015   |
+----------+----------+
| 5.0E+05  | 0.0392   |
+----------+----------+
| 5.0E+06  | 0.5345   |
+----------+----------+

Profiling in edges: 100
Profiling in edges: 1000
Profiling in edges: 10000
Profiling in edges: 100000
Profiling in edges: 1000000

=== in edges ===
+----------+----------+
| 1.0E+02  | 0.0001   |
+----------+----------+
| 1.0E+03  | 0.0001   |
+----------+----------+
| 1.0E+04  | 0.0003   |
+----------+----------+
| 1.0E+05  | 0.0061   |
+----------+----------+
| 1.0E+06  | 0.1020   |
+----------+----------+

Profiling out edges: 100
Profiling out edges: 1000
Profiling out edges: 10000
Profiling out edges: 100000
Profiling out edges: 1000000

=== out edges ===
+----------+----------+
| 1.0E+02  | 0.0001   |
+----------+----------+
| 1.0E+03  | 0.0001   |
+----------+----------+
| 1.0E+04  | 0.0003   |
+----------+----------+
| 1.0E+05  | 0.0061   |
+----------+----------+
| 1.0E+06  | 0.0983   |
+----------+----------+

Profiling edges: 100
Profiling edges: 1000
Profiling edges: 10000
Profiling edges: 100000
Profiling edges: 1000000

=== edges ===
+----------+----------+
| 1.0E+02  | 0.0001   |
+----------+----------+
| 1.0E+03  | 0.0001   |
+----------+----------+
| 1.0E+04  | 0.0002   |
+----------+----------+
| 1.0E+05  | 0.0040   |
+----------+----------+
| 1.0E+06  | 0.0534   |
+----------+----------+

Profiling edges: 100
Profiling edges: 1000
Profiling edges: 10000
Profiling edges: 100000
Profiling edges: 1000000

=== edges sorted ===
+----------+----------+
| 1.0E+02  | 0.0001   |
+----------+----------+
| 1.0E+03  | 0.0004   |
+----------+----------+
| 1.0E+04  | 0.0037   |
+----------+----------+
| 1.0E+05  | 0.0432   |
+----------+----------+
| 1.0E+06  | 0.5492   |
+----------+----------+

Profiling in degrees: 100
Profiling in degrees: 1000
Profiling in degrees: 10000
Profiling in degrees: 100000
Profiling in degrees: 1000000

=== in degrees ===
+----------+----------+
| 1.0E+02  | 0.0000   |
+----------+----------+
| 1.0E+03  | 0.0000   |
+----------+----------+
| 1.0E+04  | 0.0001   |
+----------+----------+
| 1.0E+05  | 0.0006   |
+----------+----------+
| 1.0E+06  | 0.0058   |
+----------+----------+

Profiling out degrees: 100
Profiling out degrees: 1000
Profiling out degrees: 10000
Profiling out degrees: 100000
Profiling out degrees: 1000000

=== out degrees ===
+----------+----------+
| 1.0E+02  | 0.0000   |
+----------+----------+
| 1.0E+03  | 0.0000   |
+----------+----------+
| 1.0E+04  | 0.0001   |
+----------+----------+
| 1.0E+05  | 0.0006   |
+----------+----------+
| 1.0E+06  | 0.0043   |
+----------+----------+

Profiling node subgraph: 1
Profiling node subgraph: 10
Profiling node subgraph: 100
Profiling node subgraph: 996
Profiling node subgraph: 9947

=== node subgraph (1%) ===
+----------+----------+
| 100      | 0.0001   |
+----------+----------+
| 1000     | 0.0001   |
+----------+----------+
| 10000    | 0.0002   |
+----------+----------+
| 100000   | 0.0008   |
+----------+----------+
| 1000000  | 0.0147   |
+----------+----------+

Profiling node subgraph: 5
Profiling node subgraph: 48
Profiling node subgraph: 488
Profiling node subgraph: 4898
Profiling node subgraph: 48790

=== node subgraph (5%) ===
+----------+----------+
| 100      | 0.0001   |
+----------+----------+
| 1000     | 0.0001   |
+----------+----------+
| 10000    | 0.0005   |
+----------+----------+
| 100000   | 0.0071   |
+----------+----------+
| 1000000  | 0.1671   |
+----------+----------+

Profiling node subgraph: 10
Profiling node subgraph: 93
Profiling node subgraph: 951
Profiling node subgraph: 9510
Profiling node subgraph: 95170

=== node subgraph (10%) ===
+----------+----------+
| 100      | 0.0001   |
+----------+----------+
| 1000     | 0.0002   |
+----------+----------+
| 10000    | 0.0020   |
+----------+----------+
| 100000   | 0.0231   |
+----------+----------+
| 1000000  | 0.4841   |
+----------+----------+

Profiling node subgraph: 18
Profiling node subgraph: 178
Profiling node subgraph: 1842
Profiling node subgraph: 18193
Profiling node subgraph: 181262

=== node subgraph (20%) ===
+----------+----------+
| 100      | 0.0001   |
+----------+----------+
| 1000     | 0.0004   |
+----------+----------+
| 10000    | 0.0041   |
+----------+----------+
| 100000   | 0.0634   |
+----------+----------+
| 1000000  | 1.1784   |
+----------+----------+

Profiling adjacency matrix: 100
Profiling adjacency matrix: 1000
Profiling adjacency matrix: 10000
Profiling adjacency matrix: 100000
Profiling adjacency matrix: 1000000

=== adjmat ===
+----------+----------+
| 100      | 0.0002   |
+----------+----------+
| 1000     | 0.0002   |
+----------+----------+
| 10000    | 0.0005   |
+----------+----------+
| 100000   | 0.0063   |
+----------+----------+
| 1000000  | 0.1354   |
+----------+----------+

Profiling incidence matrix: 100
Profiling incidence matrix: 1000
Profiling incidence matrix: 10000
Profiling incidence matrix: 100000
Profiling incidence matrix: 1000000

=== incmat ===
+----------+----------+
| 100      | 0.0003   |
+----------+----------+
| 1000     | 0.0003   |
+----------+----------+
| 10000    | 0.0010   |
+----------+----------+
| 100000   | 0.0127   |
+----------+----------+
| 1000000  | 0.3241   |
+----------+----------+

Profiling line graph: 100
Profiling line graph: 1000
Profiling line graph: 10000
Profiling line graph: 100000
Profiling line graph: 1000000

=== line graph (density 1/n) ===
+----------+----------+
| 1.0E+02  | 0.0001   |
+----------+----------+
| 1.0E+03  | 0.0010   |
+----------+----------+
| 1.0E+04  | 0.0074   |
+----------+----------+
| 1.0E+05  | 0.1506   |
+----------+----------+
| 1.0E+06  | 2.8589   |
+----------+----------+

Profiling union 5 graph(n=50):
Profiling union 10 graph(n=50):
Profiling union 50 graph(n=50):
Profiling union 100 graph(n=50):
Profiling union 500 graph(n=50):
Profiling union 5 graph(n=100):
Profiling union 10 graph(n=100):
Profiling union 50 graph(n=100):
Profiling union 100 graph(n=100):
Profiling union 500 graph(n=100):
Profiling union 5 graph(n=500):
Profiling union 10 graph(n=500):
Profiling union 50 graph(n=500):
Profiling union 100 graph(n=500):
Profiling union 500 graph(n=500):
Profiling union 5 graph(n=1000):
Profiling union 10 graph(n=1000):
Profiling union 50 graph(n=1000):
Profiling union 100 graph(n=1000):
Profiling union 500 graph(n=1000):

=== disjoint union graph ===
+----------+----------+
| 50x5     | 0.0003   |
+----------+----------+
| 50x10    | 0.0006   |
+----------+----------+
| 50x50    | 0.0025   |
+----------+----------+
| 50x100   | 0.0051   |
+----------+----------+
| 50x500   | 0.0300   |
+----------+----------+
| 100x5    | 0.0007   |
+----------+----------+
| 100x10   | 0.0011   |
+----------+----------+
| 100x50   | 0.0055   |
+----------+----------+
| 100x100  | 0.0116   |
+----------+----------+
| 100x500  | 0.0604   |
+----------+----------+
| 500x5    | 0.0031   |
+----------+----------+
| 500x10   | 0.0059   |
+----------+----------+
| 500x50   | 0.0351   |
+----------+----------+
| 500x100  | 0.0744   |
+----------+----------+
| 500x500  | 0.4055   |
+----------+----------+
| 1000x5   | 0.0064   |
+----------+----------+
| 1000x10  | 0.0130   |
+----------+----------+
| 1000x50  | 0.0839   |
+----------+----------+
| 1000x100 | 0.1733   |
+----------+----------+
| 1000x500 | 0.9125   |
+----------+----------+

Profiling partition 5 graph(n=50):
Profiling partition 10 graph(n=50):
Profiling partition 50 graph(n=50):
Profiling partition 100 graph(n=50):
Profiling partition 500 graph(n=50):
Profiling partition 5 graph(n=100):
Profiling partition 10 graph(n=100):
Profiling partition 50 graph(n=100):
Profiling partition 100 graph(n=100):
Profiling partition 500 graph(n=100):
Profiling partition 5 graph(n=500):
Profiling partition 10 graph(n=500):
Profiling partition 50 graph(n=500):
Profiling partition 100 graph(n=500):
Profiling partition 500 graph(n=500):
Profiling partition 5 graph(n=1000):
Profiling partition 10 graph(n=1000):
Profiling partition 50 graph(n=1000):
Profiling partition 100 graph(n=1000):
Profiling partition 500 graph(n=1000):

=== disjoint partition graph ===
+----------+----------+
| 50x5     | 0.0003   |
+----------+----------+
| 50x10    | 0.0006   |
+----------+----------+
| 50x50    | 0.0022   |
+----------+----------+
| 50x100   | 0.0044   |
+----------+----------+
| 50x500   | 0.0256   |
+----------+----------+
| 100x5    | 0.0005   |
+----------+----------+
| 100x10   | 0.0009   |
+----------+----------+
| 100x50   | 0.0038   |
+----------+----------+
| 100x100  | 0.0075   |
+----------+----------+
| 100x500  | 0.0423   |
+----------+----------+
| 500x5    | 0.0019   |
+----------+----------+
| 500x10   | 0.0033   |
+----------+----------+
| 500x50   | 0.0194   |
+----------+----------+
| 500x100  | 0.0368   |
+----------+----------+
| 500x500  | 0.2000   |
+----------+----------+
| 1000x5   | 0.0036   |
+----------+----------+
| 1000x10  | 0.0071   |
+----------+----------+
| 1000x50  | 0.0347   |
+----------+----------+
| 1000x100 | 0.0645   |
+----------+----------+
| 1000x500 | 0.3294   |
+----------+----------+

Some API corner cases

Hi all,

I want to use this issue as a reminder of the corner cases of the API semantics.
(1) Is message phase a prerequisite of the node update phase?. Our current implementation will check whether the predecessors of the node have sent messages and raise error if no message is found. This means there must be some sendto call before recvfrom. One argument for this is that if the node update does not require msgs, it can be computed outside of DGLGraph by readouting the node values and performing the transformation. I want to know your opinions on this. Do you think we should allow node update without messages?
(2) Should messages be transient tensors? Our current implementation saves messages with edges and never delete them. This increases total memory consumption. Also, what is the right semantics if recvfrom is called repeatedly?
(3) What is the msg if the node has no predecessors?
The following is the code example:

def mfunc(u, v, e_uv):
  return u['h']
def ufunc(u, msg):
  return u['h'] + msg
g = DGLGraph()
g.add_edge(0, 1)  # The graph has one edge 0->1
g.register_msg_func(mfunc)
g.register_update_func(ufunc)
g.register_reduce_func('sum')

# Case 1: what will happen?
g.recvfrom(1, [0])

# Case 2: what will happen?
g.sendto(0, 1)
g.recvfrom(1, [0])
g.recvfrom(1, [0])

# Case 3: what will happen?
g.recvfrom(0)

How to support graph batch-norm?

With current API, we can implement batch-norm by adding a shadow node and pulling/pushing globally. However, this may not be very user-friendly. An ideal syntax is

batch_norm(node_reprs['x'])

[RFC] DGLGraph APIs Related to Dynamic Graphs

Here are two small issues for DGLGraph APIs about whether the model/tutorial code can be cleaner:

set_n_initializer and set_e_initializer

For most of the time, we simply initialize features as zero tensors. Therefore it will be more convenient if we can simply call g.set_n_initializer() with

def set_n_initializer(self, initializer=None):
    """Set the initializer for empty node features.
    Initializer is a callable that returns a tensor given the shape, data type
    and device context.
    Parameters
    ----------
    initializer : callable
        The initializer.
    """
    if initializer is None:
        initializer = lambda shape, dtype, ctx: F.zeros(shape, device=ctx)
    self._node_frame.set_initializer(initializer)

Similar for set_e_initializer.

Set features
When we are generating a graph from scratch, the first time we add a feature after adding the first node, we must call g.ndata[key] = value. After there is at least a row for the column, the next time we add another node and its feature, we can call g.nodes[v].data[key] = value. This is due to L281, frame.py. Is it possible to allow g.nodes[v].data[key] = value even when the column is added as long as v == range(g.number_of_nodes())? Similar for edges.

Cannot get node attribute when setting by set_n_repr

Minimal Reproduce Code:

import dgl
import torch as th

g = dgl.DGLGraph()

g.add_node(0)
g.add_node(1)

g.set_n_repr(th.ones(2, 1))
print(g.nodes[0])
#{}
g.set_n_repr({'h': th.ones(2, 1)})
print(g.nodes[0])
#{}

However through gcn_batch.py example, we found when updata_all was called with batchable=True, everything works fine. When setting batchable=False in msg function ,it also fail to get the representation.

[BUG] 'TVMContext' object has no attribute 'type'

When running tree_lstm, I encountered a bug with the new tensor.py.

Description:

Traceback (most recent call last):
  File "train.py", line 123, in <module>
    main(args)
  File "train.py", line 78, in main
    giter = list(tensor_topo_traverse(graph, False, args))
  File "train.py", line 21, in tensor_topo_traverse
    adjmat = g._graph.adjacency_matrix().get(nd.cpu())
  File "/home/zy1404/repos/dgl/python/dgl/utils.py", line 284, in get
    self._ctx_dict[ctx] = self._generator(ctx)
  File "/home/zy1404/repos/dgl/python/dgl/graph_index.py", line 504, in <lambda>
    self._cache['adj'] = utils.CtxCachedObject(lambda ctx: F.copy_to(mat, ctx))
  File "/home/zy1404/repos/dgl/python/dgl/backend/pytorch/tensor.py", line 50, in copy_to
    if ctx.type == 'cpu':
AttributeError: 'TVMContext' object has no attribute 'type'

It seems in runtime_ctypes.py, TVMContext does not maintain type field, and the device_name method is not supported since _api_internel.py is empty.

[DISCUSSION] Potential confusion for accessing edges

In the doc for dgl.DGLGraph, it says that we can access the edges with either g.edges[src, dest] or g.edges[edge_id] for a single edge. But this can be a bit confusing as one might wonder what if I want to access two edges with edge ids. Although careful users will realize that we only allow accessing multiple edges by edge ids through a list of edge ids, this is just confusing. Perhaps we should forbid using a single int for edge id to access the associated edge and require the users to consistently use a list of ints for edge ids?

`add_edges_from` in batch.py does not preserve edge order

In batch.py:28, it seems that the new batched graph adds its edges using add_edges_from, which does not preserve the order of the supplied edge list.

This will essentially fail the following test:

g1 = dgl.DGLGraph()
g1.add_nodes_from([0,1,2, 3, 4, 5])
g1.add_edges_from([(4, 5), (4, 3), (2, 3), (2, 1), (0, 1)])
g1.edge_list
e1 = torch.randn(5, 10)
g1.set_e_repr(e1)
g2 = dgl.DGLGraph()
g2.add_nodes_from([0, 1, 2, 3, 4, 5])
g2.add_edges_from([(0, 1), (1, 2), (2, 3), (5, 4), (4, 3), (5, 0)])
e2 = torch.randn(6, 10)
g2.set_e_repr(e2)
g = dgl.batch([g1, g2])
r1 = g.get_e_repr()[g.get_edge_id(4, 5)]
r2 = g1.get_e_repr()[g1.get_edge_id(4, 5)]
print(r1)
print(r2)
assert torch.equal(r1, r2)

API rename

Currently, our API names are confusing (such as update_edge and update_by_edge). I proposed to have following changes:

sendto -> send to match recv
update_by_edge -> send_and_recv
update_to -> pull
update_from -> push

Two new APIs to easily apply changes to the node/edges:

apply_nodes: Equivalent to recv API without messages.
apply_edges: Equivalent to update_edge API without src/dst node features.

[BUG] Tree-LSTM performs terrible after using new scheduler & executor

Tree LSTM behaves differently before and after #140 .

As a reference:
DGL version: commit 3e8b63e
The result of epoch 0:

Epoch 00000 training time 23.9037s
Epoch 00000 | Dev Acc 0.7920 | Root Acc 0.4278
Epoch 00000 | Test Acc 0.7897 | Root Acc 0.4380

DGL version: commit deb653f
The result of epoch 0:

Epoch 00000 training time 23.2815s
Epoch 00000 | Dev Acc 0.6856 | Root Acc 0.2552
Epoch 00000 | Test Acc 0.6859 | Root Acc 0.2271

In fact, as a 5-class classification task, the baseline is supposed to be close to 1/5, this being said the model is not being trained correctly.

Still have no idea about it.

TODOs for C++ GraphIndex

Implementations
GraphIndex related:

Graph operators: Currently, only DisjointUnion has been implemented.
~~- [ ] Graph traversal: A topological traversal implementation in C++ that is used by tree-lstm example.~~ (Move to next PR for improving TreeLSTM)

BatchedGraph related batch.py:

Basic implementation: batch, unbatch.
~~- [ ] Need to finish other APIs: split, slice/update, readout.~~ (Move to next PR for improving DGMG)

Subgraph related: (@zheng-da )
Discussion thread #67

For graph index, currently only node subgraph has been implemented in C++ side.
~~On python side, we need to rewrite the DGLSubgraph class to support: shared-write on node/edge features.~~(move to next PR)

Necessary optimizations

Zero-copy between dgl.ndarray and torch.tensor.
Fused implementation of send_and_recv.
Lowering degree bucketing to C++.
Benchmark the throughput of GraphIndex

Project managements

New install script to also compile the cpp files.
New jenkins script for CI.
I suggest we setup dockerfile for our project to ease the above jenkins setup. Could borrow from TVM again.
Setup lint check for C++ files. Borrow from dmlc-core.

Others

Remove batchable=False mode and related tests/codes.
Remove anonymous representation.
Double check all the models. Benchmark their speed and accuracy under the new change. (TBD: LineGraph) (DGMG and JTNN will be included in the next PR)

The goal is to make sure that once the cpp branch is merged, all the tests and examples are still working. Therefore, the BatchedGraph and Subgraph utilities can be incomplete. They only need to serve the current examples.

[BUG] update_edge API broken

import dgl
import torch as th
g = dgl.DGLGraph()
g.add_nodes(10)
g.add_edges([0, 1, 2, 3, 4], [5, 6, 7, 8, 9])
g.set_e_repr({'e': th.ones(5)})
g.update_edge(edge_func=lambda u, v, edge: edge['e'])

Bug in _batch_reduce() with dicts

In graph.py:647

all_reduced_msgs = {key : F.pack(val) for key, val in reduced_msgs.items()}

reduced_msgs is always a list, so it throws an error if the reducer returns a dict.
Fix:

keys = reduced_msgs[0].keys()
all_reduced_msgs = {
        key : F.pack([msg[key] for msg in reduced_msgs]) 
        for key in keys}

Subgraph sampling

as proposed in this design, subgraph sampling has three steps:

compute the sampling probability. There are many different ways of computing sampling probability. The output of this step is the sampling probability associated with each vertex. That is, it's a 1D NDArray.

def compute_prob(g, method) => 1D NDArray

sample vertices based on the probability. The step traverses local neighborhoods from the seed vertices and sample neighbors based on the sampling probability. It returns one 1D array contains sampled vertices.

def sample_vertices(g, prob, seed_vertices, num_hops, traverse_method, max_samples) => 1D NDArray

construct an induced subgraph with sampled vertices. There is an efficient implementation in MXNet for CSRArrays. https://github.com/zheng-da/incubator-mxnet/blob/dgl_subgraph/src/operator/dgl_graph.cc#L192

def node_subgraph(g, vertices) => subgraph

Each step can be written as a module to make the design and implementation of each step simpler. A modular design also gives users the freedom of choosing a way to compute sampling probability and sampling strategies. Each step can be compute-intensive. Thus, it won't cause any performance overhead.

Anonymous representations

Currently, if an update function returns an object other than a dict, the object is supposed to be the new (anonymous) node/edge representation (named __N_REPR__ or __E_REPR__ implicitly). However, a node/edge with an anonymous representation may have explicitly named representations as well (those returned in a dict by an update function).

When we pass node/edge representations to user-defined functions, how to tell whether user demands anonymous representations or explicitly named representations? Maybe we need to enforce the rule that once a node/edge has an anonymous representation, it cannot have any explicitly named representations, and vice versa.

[BUG] Could not set attributes in BatchedDGLGraph

After updated to the latest version, the following code does not work anymore:

import dgl
import torch as th
a = dgl.DGLGraph()
a.add_nodes(4)
b = dgl.DGLGraph()
b.add_nodes(3)
c = dgl.batch([a, b])
c.ndata['h'] = th.ones(7, 1)

Error info:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zy1404/repos/dgl/python/dgl/view.py", line 61, in __setitem__
    self._graph.set_n_repr({key : val}, self._nodes)
  File "/home/zy1404/repos/dgl/python/dgl/graph.py", line 821, in set_n_repr
    self._node_frame[key] = val
  File "/home/zy1404/repos/dgl/python/dgl/frame.py", line 618, in __setitem__
    self.update_column(key, val, inplace=False)
  File "/home/zy1404/repos/dgl/python/dgl/frame.py", line 647, in update_column
    self._frame[name] = col
  File "/home/zy1404/repos/dgl/python/dgl/frame.py", line 291, in __setitem__
    self.update_column(name, data)
  File "/home/zy1404/repos/dgl/python/dgl/frame.py", line 364, in update_column
    (self.num_rows, len(col)))
dgl._ffi.base.DGLError: Expected data to have 0 rows, got 7.

Forbid multiple edges when multigraph is False in DGLGraph

If I understand correctly, when a dgl.DGLGraph object is created, by default it is not a multigraph. However when multiple edges are added with the same end nodes and direction, no warnings or errors are raised and it gets multiple edges created as if the graph object is a multigraph.

Dynamically maintaining line graph?

As discussed in the meeting earlier, Junction Tree VAE requires us to perform partial loopy-BP-like message passing (in particular, edge messages are computed in DFS order) while generating nodes when decoding a tree during inference.

It seems that we somehow need to maintain a dynamic non-backtracking line graph, handling (appended) new edges on-the-fly, in order to use the DGL reduce primitive there.

Just for bookkeeping this issue.

Problem of Memory Leak

In my current capsule implementation, I have to manually call del self.g in my code, unless it will raise CUDA Out of Memory Error. Not sure what's the problem.
Code link: here
Run by python main.py, link

[DISCUSSION] Decouple networkx and deprecate batchable=False mode?

With the upcoming new C++ graph library support, it is time to think about decoupling DGLGraph from networkx and whether we want to continue support batchable=False model.

Here is how the DGLGraph will look like when it is backed by C++. You can see several changes:
(1) DGLGraph no longer inherits from networkx's DiGraph. Instead, we have our own GraphIndex object that directly calls C apis.
(2) The weird CachedGraph object is removed. It is replaced by our own graph library, so python-igraph is no longer a dependency.
(3) There is only one kind of node/edge attribute storage that is dgl.Frame. Data cannot be stored in networkx's dict-of-dict-of-dict structure any more. As a result, batchable=False model cannot be supported.
(4) Networkx graph can be converted from/to DGLGraph very easily. We also plan to support automatically convert networkx's node/edge attributes from/to our frame storage (issue #41 ).

I think non-batchable mode has served its purpose as a very naive baseline in the very beginning. It adds extra confusion for new users.

What do you guys think?

Running TODOs

Hi all,

Let's use this thread to book-keep all the major TODOs before our first major milestone.

High priority
Tasks that have a major contributor to be in charge of.

Low priority
Tasks that anyone can contribute when they don't have high priority tasks at the moment.

Applications

The link in the readme file is broken

Minor issue, as the title.

No longer allowed to set new features for dgl.BatchedDGLGraph

In previous versions of dgl, it is allowed to add new field for a dgl.BatchedDGLGraph object. This seems no longer allowed now, which is a bit troublesome. For example, when performing batched readout, one may need to create some auxiliary fields.

You can use the code below for testing:

import dgl
import torch

g1 = dgl.DGLGraph()
g1.add_nodes(2)
g2 = dgl.DGLGraph()
g2.add_nodes(1)
bg = dgl.batch([g1, g2])
bg.ndata['h'] = torch.randn(3, 1)

[USABILITY] Error in send_and_recv on partial nodes

It seems that send_and_recv didn't properly process the index of partial nodes.

import dgl
import torch as th

g = dgl.DGLGraph()
g.add_nodes(100)
g.add_edges([20, 21, 22, 23, 24], [25, 26, 27, 28, 29])
g.set_e_repr({'e': th.ones(5)})
g.set_n_repr({'h': th.ones(100)})


def msg(src, edge):
    return {'h': src['h']}


def reduce(node, msg):
    return {'sum': msg['h']}


def update(node):
    return {'h': node['sum']}

## Works Fine
g.update_all(msg, reduce, update)

## Error
g.send_and_recv([20, 21, 22, 23, 24], [25, 26, 27, 28, 29], msg, reduce, update)

Tree-LSTM and msg/reduce/update functions

We met a problem when implementing tree-lstm with the new reduce function. The Child-Sum Tree-LSTM is described as follows:

@GaiYu0 has implemented this when there are only msg and update functions (see codes here). However, with the new reduce function, it is tricky. Remember that the msg/reduce/update function signatures are defined as follows:

def message_func(src, dst, edge):
   # return the message to be sent along the edge.
   ...

def reduce_func(msgs):
   # return the reduced msg
   ...

def update_func(node, msg_reduced):
   # return the new node state
   ...

Let's look at the equations. Apparently, equation (2) is a summation that should be implemented in the reduce function. Equations (3), (5), (6) rely on the node state of the receiver node (x_j) and the reduced hidden state (h_tilda), so they should be put in the update function. Equations (4) and (7) are problematic. Equation (4) needs to compute a "pair-wise" transformation using the node state of the receiver and sender nodes, and the result is applied as the weight in the summation in equation (7). This computation paradigm is similar to the GAT model but without the softmax. As a result, equation (4) and (7) should be put in the reduce function but the reduce function has no access to the receiver state (x_j). This means the user needs to also include the dst node state (x_j) in the message. @ylfdq1118 also mentioned this problem in the GAT model before. As a result, we propose to change the msg/reduce/update signatures to followings:

def message_func(src, edge):
   # return the message to be sent along the edge.
   ...

def update_edge_func(src, dst, edge):
   # return the new edge state
   ...

def reduce_func(node, msgs):
   # return the reduced msg
   ...

def update_func(node, msg_reduced):
   # return the new node state
   ...

Change#1: The message function no longer has access to the dst node state. By contrast, the update edge function still has access to both end points.
Change#2: The reduce function now has access to the receiver node state.

This seems to be a step back as the new reduce function signature is equal to the old update function signature before we decided to add reduce. Reduce function is a very powerful tool as user can fuse the message and update function into a single reduce function now:

def message_func(src, edge):
  # Dummy message function that simply returns all the inputs.
  return {'src' : src, 'edge': edge}
def reduce_func(node, msgs):
  # The reduce function now has access to both the dst node and all the src nodes.
  ...
def update_func(node, msg_reduced):
  # Dummy update function that simply returns the reduced msg.
  return msg_reduced

We think this change is necessary as message/reduce/update functions have different batching approach: message function is batched by all the edges; reduce function is batched by nodes with the same degree; update function is batched by all the nodes. Since batching by all the edges/nodes are more efficient by the constrained batching of the reduce function, it is user's responsibility to make the reduce function as light as possible while putting more logic in the message/update functions. The best case is when reduce function is the built-in reducer (e.g. "sum"); the worst case is when all the logic is fused in the reduce function. For example, a good implementation of tree-lstm using the new signature is as follows. You can see how the equations are computed in different functions.

def message_func(src, edge):
  return {'h' : src['h'], 'c' : src['c']}
def reduce_func(node, msgs):
  # equation (2)
  h_tild = th.sum(msgs['h'], 1)
  # equation (4)
  wx = th.mm(v['x'], W_f).unsqueeze(1)
  uh = th.mm(msgs['h'], U_f)
  f = th.sigmoid(wx + uh + b_f)
  # equation (7) second term
  c_tild = th.sum(f * msgs['c'], 1)
  return {'h_tild' : h_tild, 'c_tild' : c_tild}
def update_func(v, accum):
  # equation (3), (5), (6)
  iou = th.mm(v['x'], W_iou) + th.mm(accum['h_tild'], self.U_iou) + self.b_iou
  i, o, u = th.chunk(iou, 3, 1)
  i, o, u = th.sigmoid(i), th.sigmoid(o), th.tanh(u)
  # equation (7)
  c = i * u + accum['c_tild']
  # equation (8)
  h = o * th.tanh(c)
  return {'h' : h, 'c' : c}

I may have just brought up the problem that has been found by @ylfdq1118 when implementing the GAT model.

[BUG?] Messages are not passed and reduced in parallel with send_and_recv

Old

In the older version of API, with the following block of code, the messages are passed and reduced in parallel:

def gcn_msg(src, edge):
    return {'m': src['h']}

def gcn_reduce(node, msgs):
    return {'h': torch.sum(msgs['m'], 1)}

g = dgl.DGLGraph()
g.add_nodes(3)
g.add_edges([0, 1, 0], [1, 0, 2])
g.set_n_repr({'h': torch.tensor([[1.], [2.], [3.]])})
g.send_and_recv([0, 1, 0], [1, 0, 2], gcn_msg, gcn_reduce)

After send_and_recv,

g.get_n_repr()['h']

yields

tensor([[2.],
        [1.],
        [1.]])

New

With current APIs, a different behavior is observed:

def gcn_msg(edges):
    return {'m' : edges.src['h']}

def gcn_reduce(nodes):
    return {'h' : torch.sum(nodes.mailbox['m'], 1)}

g = dgl.DGLGraph()
g.add_nodes(3)
g.add_edges([0, 1, 0], [1, 0, 2])
g.ndata['h'] = torch.tensor([[1.], [2.], [3.]])
g.send_and_recv([[0, 1, 0], [1, 0, 2]], gcn_msg, gcn_reduce)

Now after send_and_recv,

g.ndata['h']

yields

tensor([[2.],
        [1.],
        [3.]])

I tried replacing send_and_recv above with update_all and such a discrepancy is not observed. Any idea why this is the case?

About message and update function

I'm thinking about how to make the API cleaner. Basically, we only have two types of computations:

Computation on edge. It takes the states of the src and dst nodes, and the old states of the edge and returns the new states of the edges. The signature can be summarized as:
(node_states, node_states, edge_states) -> edge_states
Computation on node. It takes the states of the in-coming edges and the old node states to compute the new node states. The signature can be summarized as:
(list[edge_states], node_states) -> node_states

In networkx, all node and edge states are dictionaries (you can use type(g.nodes[node]) to verify this).

A message function can be viewed as a special edge computation as follows:

def edge_func(src_node_states, dst_node_states, edge_states):
  msg = ...  # compute message
  new_edge_states = edge_states
  new_edge_states['msg'] = msg
  return new_edge_states

Because this pattern is so common, we can let user only returns the msg and handle the edge state update by ourselves:

def mfunc(node_states, edge_states):
  #User-defined
  msg = ... # compute message
  return msg
def mfunc_wrapper(src_node_states, dst_node_states, edge_states):
  #System wrapper
  new_edge_states = edge_states
  new_edge_states['msg'] = mfunc(src_node_states, edge_states)
  return new_edge_states

Similarly, we can wrap the update function as node function as well:

def ufunc(msgs, edge_states, node_states):
  #User-defined
  new_node_states = ...  # compute new node state
  return new_node_states
def ufunc_wrapper(edge_states, node_states):
  #System wrapper
  msgs = [es['msg'] for es in edge_states]
  new_node_states = ufunc(msgs, edge_states, node_states)
  return new_node_states

I want to know your opinions about this implementation. Using this idea, we might have two basic APIs: register_edge_func and register_node_func. And have register_message_func and register_update_func be two APIs that simply wrap the above two.

[GraphIndex] Graph traversal

BFS/Topological traversal

Process nodes/edges belonging to the same layer (BFS) / frontier (topological traversal) in the same batch.

bfs(G[, source, depth_limit]) and topo(G) returns a list of Tensor of layer/frontier nodes
bfs_edge(G[, source, depth_limit]) and topo_edge(G) returns a list of Tensor of layer/frontier edges

where source is a collection of nodes that specifies the initial BFS layer and depth_limit specifies the maximum BFS depth.

DFS

Process nodes/edges in the same strongly connected component in sequential DFS order.
Nodes/edges belonging to different paths in the same strongly connected component cannot be processed in the same batch because of possible dependency.
However, the traversal of different strongly connected components are processed in the same batch because there is no dependency.
We only consider strongly connected components that are graphs batched in DGLBatchedGraph.

NetworkX provides the following API's for DFS:

dfs_edges(G[, source, depth_limit])
dfs_tree(G[, source, depth_limit])
dfs_predecessors(G[, source, depth_limit])
dfs_successors(G[, source, depth_limit])
dfs_preorder_nodes(G[, source, depth_limit])
dfs_postorder_nodes(G[, source, depth_limit])
dfs_labeled_edges(G[, source, depth_limit]) (used in JT-VAE example)

We can extend these API's such that source can be a container of nodes (belonging to different graphs) and a call to them yields a list of Tensor of nodes instead of a list of nodes.