aravindsankar28 / dysat Goto Github PK

Representation learning on dynamic graphs using self-attention networks

Python 55.32% Jupyter Notebook 43.92% TSQL 0.76%

dynamic-graphs self-attention graph-neural-network graph-embedding

dysat's Introduction

DySAT: Deep Neural Representation Learning on Dynamic Graphs via Self-Attention Networks.

Contributors: Aravind Sankar ([email protected]).

Aravind Sankar, Yanhong Wu, Liang Gou, Wei Zhang, and Hao Yang, "DySAT: Deep Neural Representation Learning on Dynamic Graphs via Self-Attention Networks", International Conference on Web Search and Data Mining, WSDM 2020, Houston, TX, February 3-7, 2020.

This repository contains a TensorFlow implementation of DySAT - Dynamic Self Attention (DySAT) networks for dynamic graph representation Learning. DySAT is an unsupervised graph embedding model to learn node embeddings in dynamic time-evolving attributed graphs, which may later be used for downstream application tasks such as link prediction, clustering and node classification.

Note: Though DySAT is designed for attributed dynamic graphs, our benchmarking experiments are carried out on datasets that do not have node attributes.

Incremental Dynamic Graph Embedding

To support streaming graph applications, we also provide an implementation of Incremental Self-Attention (IncSAT) Networks to learn dynamic incremental node embeddings in a stage-wise fashion. See our extended arxiv version for details on the algorithm.

If you make use of this code or the DySAT algorithm in your work, please cite our papers:

@article{sankar2018dynamic,
  title={Dynamic Graph Representation Learning via Self-Attention Networks},
  author={Sankar, Aravind and Wu, Yanhong and Gou, Liang and Zhang, Wei and Yang, Hao},
  journal={arXiv preprint arXiv:1812.09430},
  year={2018}
}

@inproceedings{sankar2020dysat,
  title={DySAT: Deep Neural Representation Learning on Dynamic Graphs via Self-Attention Networks},
  author={Sankar, Aravind and Wu, Yanhong and Gou, Liang and Zhang, Wei and Yang, Hao},
  booktitle={Proceedings of the 13th International Conference on Web Search and Data Mining},
  pages={519--527},
  year={2020}
}

Requirements:

Recent versions of TensorFlow (<= 1.14), numpy, scipy, sklearn, and networkx (<= 1.11) are required. The code has been tested under Python 2.7. The required packages can be installed using the following command:

$ pip install -r requirements.txt

To guarantee that you have the right package versions, you can use Anaconda to set up a virtual environment and install the dependencies from requirements.txt.

Input Format

In order to use your own data, you have to provide:

graphs: list of networkx graphs (or multigraphs) for each time step, saved as .npz files. Have a look at the load_graphs() and load_feats() functions in utils/preprocess.py for an example.
features: list of N x D feature matrices (N is the number of nodes and D is the number of features per node) in scipy sparse format) -- optional.

Repository Organization

data/ contains the necessary input file(s) for each dataset after pre-processing.
raw_data/ contains data pre-processing jupyter notebooks for reference.
models/ contains the implementation of two models - DySAT and IncSAT.
utils/ contains:
- preprocessing subroutines (preprocess.py, utilities.py, random_walk.py);
- minibatch iterators (minibatch.py, incremental_minibatch.py);
eval/ contains evaluation scripts that use simple logistic regression classifiers for link prediction based on the learnt node embeddings.

The pre-processed versions of all datasets are available here.

Running the code

The code can be run by executing python run_script.py. The default values of all parameters are set in the script file and can be specified as command line arguments. The most important arguments are min_time and max_time that specify the range of time steps to train the model. This script calls multiple instances of train.py (or train_incremental.py) with time steps in this range (both ends included).

For example, if min_time is 2 and max_time is 3, two instances of the model are trained, where the first one trains on the G₁, while the second instance trains on G₁ and G₂. In case of link prediction, the evaluation is performed on the links in G₂ for the first instance, and the links of G₃ for the second.

The other hyper-parameters of the model are specified in run_script.py (along with detailed descriptions) and may need to be appropriately tuned for different datasets.

Logging Directory

For logging, the model flag should be provided to specify the variant/version of the experimented model (initially set to default), in addition to choosing base_model as DySAT or IncSAT.

A logging directory log_dir is then created at ./logs/<base_model>_<model>/, overwriting any existing files that might conflict.

The output of the model, log files and evaluation results (on link prediction) will be stored in subdirectories of log_dir, with date-wise logged files, along with the set of hyper-parameters and settings used in the experiment.

The learnt embeddings will be stored in numpy formatted files at subdirectory output/ and the results of downstream evaluation tasks will be stored in a subdirectory csv/, within log_dir.

dysat's People

Contributors

Stargazers

Watchers

Forkers

boxindu skx300 mqrshiyan zhh0998 tartaruszen jihochoi wjm199717 wangjianlongnba know-nothing8 mtgu t-ichiyou markwjj jdc08161063 cecily01001 minglangqiao tensor-mutator csudragonzl dsj96 sherylhyx stylishsdp judedarwinism wangwei7911 shuowang-ai xdjwolf huahuachang leoking0001 zuoxijunxifu andrewwangle chenfhcs icloudsong aly-oliver linhduongtuan tiger-tiger shi-d ccwwwcc liane886 techthiyanes sparkinghua mengyi208 5l1v3r1 ai-in-biomedical-science fn-kobe teobianco sailfish009 s0urc-3 toufunao

dysat's Issues

DySAT w/o temporal attention is better in Enron

Thanks for your contribution. I find that DySAT w/o temporal attention is better in Enron.

I change code to delete Temporal Attention part in file models.py.

# 5: Temporal Attention forward
        temporal_inputs = structural_outputs
        outputs= temporal_inputs
        # for temporal_layer in self.temporal_attention_layers:
        #     outputs = temporal_layer(temporal_inputs)  # [N, T, F]
        #     temporal_inputs = outputs
        #     self.attn_wts_all.append(temporal_layer.attn_wts_all)
        return outputs

And run
python train.py --dataset Enron_new --time_steps 16

get results w/o temporal attention :

default results (val) [0.8700378071833649, 0.8700378071833649]
default results (val) [0.8851606805293006, 0.8851606805293006]
default results (test) [0.90290357641944, 0.90290357641944]
default results (test) [0.9008850473577972, 0.9008850473577972]

while DySat with temporal attention has results

default results (val) [0.9040642722117203, 0.9040642722117203]                                                                                                                          
default results (val) [0.9064272211720227, 0.9064272211720227]                                                                                                                          
default results (test) [0.8758863412866829, 0.8758863412866829]                                                                                                                         
default results (test) [0.8781636561254593, 0.8781636561254593]

It seems that DySAT w/o temporal attention performs better than one with temporal attention. Is there something wrong ?

two erros when running run_script.py

sorry to bother you, I found the following errors when running run_script.py, one is " Unresolved reference 'raw_input' ", One is the error "from layers import *" in the first line of model.py," ModuleNotFoundError: No module named 'layers' ", how to solve them?
thanks very much for your reply.

code issue

When I run train.py, there is a problem following:
absl.flags._exceptions.DuplicateFlagError: The flag 'log_dir' is defined twice. First from absl.logging, Second from flags. Description from first occurrence: directory to write logfiles into

Only increase learning rate from 1e-3 to 1e-2, AUC significantly better than results reported, why?

Hi,
Thanks for your contribution! I cloned your codes (python3 version) and trained the model using Enron_new you provided. When setting learning rate=0.001 (default value), the final AUC is around 0.86 (as you reported). However, when I increased the learning rate to 0.01 and regenerated context pairs and eval data, the final AUC reached 0.89. I have tested several times in the same environment as requirement.txt.
Did you try to tune the learning rate when you training this dataset? If not, this is a free and significant improvement, right?

how to change the embedding size?

I have a question about how to change the embedding size, I change the temporal_layer_config and structural_layer_config from 128 to 16 and the code doesnt work. If I only change the temporal_layer_config to 16, the result embedding size is still 128.

The weight of the edge

First of all thank you for your work. It is mentioned in the paper that undirected weighted graphs can be handled, but the weight of the edge can be processed is not withdrawn in the code. The data given in the code also has no weight. So how can you deal with weighted charts?

ModuleNotFoundError: No module named 'layers'

Hello,

When I run "run_script.py", I got the following error. I really don know how to solve it.

I created a virtual environment and installed all the modules in the requirement.txt.

请问对比实验中的GAT-AE等方法的代码是否可以开源呢？

The code seems not to be the same as in the paper

In preprocess.py it generates context pairs during 1 to T, and the model learned from 1 to T to predict link in time T?

def get_context_pairs(graphs, num_time_steps):
    """ Load/generate context pairs for each snapshot through random walk sampling."""
    load_path = "data/{}/train_pairs_n2v_{}.pkl".format(FLAGS.dataset, str(num_time_steps - 2))
    try:
        context_pairs_train = dill.load(open(load_path, 'rb'))
        print("Loaded context pairs from pkl file directly")
    except (IOError, EOFError):
        print("Computing training pairs ...")
        context_pairs_train = []
        for i in range(0, num_time_steps):
            context_pairs_train.append(run_random_walks_n2v(graphs[i], graphs[i].nodes()))
        dill.dump(context_pairs_train, open(load_path, 'wb'))
        print ("Saved pairs")

    return context_pairs_train

maybe for i in range(0, num_time_steps - 1):?

Unable to reproduce results on ML-10M

Thank you so much for sharing the open source code.
However, I downloaded the ml-10m data set you provided.
On this dataset, I tried to use Enron's parameter settings for experiments, and I found that the results of the algorithm are lower than those given in the paper.
Can you provide your parameters on the ml-10m data set?
Thanks again for your work and time.
I wish you all the best!

How is the output emb.npz sorted?

Dear authors, I successfully ran the codes run_script.py and generated the embeddings for each timestep.

How to build the graph snapshots?

I can only see you load graph snapshots .npz file in your code, but I want to know how to build it from your raw dataset.
I want to build graph snapshots in my own dataset, so I need your help. Thanks!

dataset

Thank you for reading! The work was so interesting and it was attractive to me.
I am wondering if you could kindly send me the source dataset about it. I promise they will be used only for research purposed.
No matter whether you agree or not, best wishes to you. I would appreciate it if you could help me.
Thanks and regards.

No module named 'networkx.classes.reportviews'

I want to run my data with your code, my data's type is (u, v, t) and I follow the preprocess.ipynb in enron and create my graphs, my timestamp is float not date type and (u,v) is index , but I meet an error "No module named 'networkx.classes.reportviews' "when I use load_graphs to load my graphs.

slice_id = 0
for (a, b, t) in new_links:
prev_slice_id = slice_id
slice_id = time_dict[t]
datetime_object = t
if slice_id == 1+prev_slice_id and slice_id > 0:
slices_links[slice_id].add_nodes_from(slices_links[slice_id-1].nodes(data=True))
if a not in slices_links[slice_id]:
slices_links[slice_id].add_node(a)
if b not in slices_links[slice_id]:
slices_links[slice_id].add_node(b)
slices_links[slice_id].add_edge(a,b, date=datetime_object)

def new_remap(slices_graph):
slices_graph_remap = []
for slice_id in slices_graph:
G = nx.MultiGraph()
for x in slices_graph[slice_id].nodes():
G.add_node(x)
for x in slices_graph[slice_id].edges(data=True):
G.add_edge(x[0], x[1], date=x[2]['date'])
assert (len(G.nodes()) == len(slices_graph[slice_id].nodes()))
assert (len(G.edges()) == len(slices_graph[slice_id].edges()))
slices_graph_remap.append(G)
return slices_graph_remap

slices_links_remap = new_remap(slices_links)
np.savez('graphs.npz', graph=slices_links_remap)

Did anyone have the same error?