Giter Site home page Giter Site logo

graphsage's Introduction

GraphSAGE: Inductive Representation Learning on Large Graphs

Authors: William L. Hamilton ([email protected]), Rex Ying ([email protected])

Project Website

Overview

This directory contains code necessary to run the GraphSAGE algorithm. See our paper for details on the algorithm.

The example_data subdirectory contains a small example of the protein-protein interaction data, which includes 3 training graphs + one validation graph and one test graph. The full Reddit and PPI datasets (described in the paper) are available on the project website.

If you make use of this code or the GraphSAGE algorithm in your work, please cite the following paper:

 @article{hamilton2017inductive,
     author = {Hamilton, William L. and Ying, Rex and Leskovec, Jure},
     title = {Inductive Representation Learning on Large Graphs},
     journal = {arXiv preprint, arXiv:1603.04467},
     year = {2017}
   }

Requirements

Recent versions of TensorFlow, numpy, scipy, and networkx are required.

Running the code

The example_unsupervised.sh and example_supervised.sh files contain example usages of the code, which use the unsupervised and supervised variants of GraphSAGE, respectively. Note that example_unsupervised.sh sets a very small max iteration number, which can be increased to improve performance. We generally found that performance continued to improve even after the loss was very near convergence (i.e., even when the loss was decreasing at a very slow rate).

Note: For the PPI data, and any other multi-ouput dataset that allows individual nodes to belong to multiple classes, it is necessary to set the --sigmoid flag during supervised training. By default the model assumes that the dataset is in the "one-hot" categorical setting.

Input format

As input, at minimum the code requires that a --train_prefix option is specified which specifies the following data files:

  • <train_prefix>-G.json -- A networkx-specified json file describing the input graph. Nodes have 'val' and 'test' attributes specifying if they are a part of the validation and test sets, respectively.
  • <train_prefix>-id_map.json -- A json-stored dictionary mapping the graph node ids to consecutive integers.
  • <train_prefix>-id_map.json -- A json-stored dictionary mapping the graph node ids to classes.
  • <train_prefix>-feats.npy --- A numpy-stored array of node features; ordering given by id_map.json
  • <train_prefix>-walks.txt --- A text file specifying random walk co-occurrences (one pair per line) (*only for unsupervised version of graphsage)

To run the model on a new dataset, you need to make data files in the format described above. To run random walks for the unsupervised model and to generate the -walks.txt file) you can use the run_walks function in graphsage.utils.

Model variants

The user must also specify a --model, the variants of which are described in detail in the paper:

  • graphsage_mean -- GraphSAGE with mean-based aggregator
  • graphsage_seq -- GraphSAGE with LSTM-based aggregator
  • graphsage_pool -- GraphSAGE with max-pooling aggregator
  • gcn -- GraphSAGE with GCN-based aggregator
  • n2v -- an implementation of DeepWalk (called n2v for short in the code.)

Logging directory

Finally, a --base_log_dir should be specified (it defaults to the current directory). The output of the model and log files will be stored in a subdirectory of the base_log_dir. The path to the logged data will be of the form <sup/unsup>-<data_prefix>/graphsage-<model_description>/. The supervised model will output F1 scores, while the unsupervised model will train embeddings and store them. The unsupervised embeddings will be stored in a numpy formated file named val.npy with val.txt specifying the order of embeddings as a per-line list of node ids. Note that the full log outputs and stored embeddings can be 5-10Gb in size (on the full data when running with the unsupervised variant).

Using the output of the unsupervised models

The unsupervised variants of GraphSAGE will output embeddings to the logging directory as described above. These embeddings can then be used in downstream machine learning applications. The eval_scripts directory contains examples of feeding the embeddings into simple logistic classifiers.

Acknowledgements

The original version of this code base was originally forked from https://github.com/tkipf/gcn/, and we owe many thanks to Thomas Kipf for making his code available. We also thank Yuanfang Li and Xin Li who contributed to a course project that was based on this work. Please see the paper for funding details and additional (non-code related) acknowledgements.

graphsage's People

Contributors

williamleif avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.