Giter Site home page Giter Site logo

pytorch-biggraph's Introduction

PyTorch-BigGraph

Support Ukraine CircleCI Status Documentation Status

PyTorch-BigGraph (PBG) is a distributed system for learning graph embeddings for large graphs, particularly big web interaction graphs with up to billions of entities and trillions of edges.

PBG was introduced in the PyTorch-BigGraph: A Large-scale Graph Embedding Framework paper, presented at the SysML conference in 2019.

Update: PBG now supports GPU training. Check out the GPU Training section below!

Overview

PBG trains on an input graph by ingesting its list of edges, each identified by its source and target entities and, possibly, a relation type. It outputs a feature vector (embedding) for each entity, trying to place adjacent entities close to each other in the vector space, while pushing unconnected entities apart. Therefore, entities that have a similar distribution of neighbors will end up being nearby.

It is possible to configure each relation type to calculate this "proximity score" in a different way, with the parameters (if any) learned during training. This allows the same underlying entity embeddings to be shared among multiple relation types.

The generality and extensibility of its model allows PBG to train a number of models from the knowledge graph embedding literature, including TransE, RESCAL, DistMult and ComplEx.

PBG is designed with scale in mind, and achieves it through:

  • graph partitioning, so that the model does not have to be fully loaded into memory
  • multi-threaded computation on each machine
  • distributed execution across multiple machines (optional), all simultaneously operating on disjoint parts of the graph
  • batched negative sampling, allowing for processing >1 million edges/sec/machine with 100 negatives per edge

PBG is not optimized for small graphs. If your graph has fewer than 100,000 nodes, consider using KBC with the ComplEx model and N3 regularizer. KBC produces state-of-the-art embeddings for graphs that can fit on a single GPU. Compared to KBC, PyTorch-BigGraph enables learning on very large graphs whose embeddings wouldn't fit in a single GPU or a single machine, but may not produce high-quality embeddings for small graphs without careful tuning.

Requirements

PBG is written in Python (version 3.6 or later) and relies on PyTorch (at least version 1.0) and a few other libraries.

All computations are performed on the CPU, therefore a large number of cores is advisable. No GPU is necessary.

When running on multiple machines, they need to be able to communicate to each other at high bandwidth (10 Gbps or higher recommended) and have access to a shared filesystem (for checkpointing). PBG uses torch.distributed, which uses the Gloo package which runs on top of TCP or MPI.

Installation

Clone the repository (or download it as an archive) and, inside the top-level directory, run:

pip install .

PyTorch-BigGraph includes some C++ kernels that are only used for the experimental GPU mode. If you want to use GPU mode, compile the C++ code as follows:

PBG_INSTALL_CPP=1 pip install .

Everything will work identically except that you will be able to run GPU training (torchbiggraph_train_gpu).

The results of the paper can easily be reproduced by running the following command (which executes this script):

torchbiggraph_example_fb15k

This will download the Freebase 15k knowledge base dataset, put it into the right format, train on it using the ComplEx model and finally perform an evaluation of the learned embeddings that calculates the MRR and other metrics that should match the paper. Another command, torchbiggraph_example_livejournal, does the same for the LiveJournal interaction graph dataset.

To learn how to use PBG, let us walk through what the FB15k script does.

Getting started

Downloading the data

First, it retrieves the dataset and unpacks it, obtaining a directory with three edge sets as TSV files, for training, validation and testing.

wget https://dl.fbaipublicfiles.com/starspace/fb15k.tgz -P data
tar xf data/fb15k.tgz -C data

Each line of these files contains information about one edge. Using tabs as separators, the lines are divided into columns which contain the identifiers of the source entities, the relation types and the target entities. For example:

/m/027rn	/location/country/form_of_government	/m/06cx9
/m/017dcd	/tv/tv_program/regular_cast./tv/regular_tv_appearance/actor	/m/06v8s0
/m/07s9rl0	/media_common/netflix_genre/titles	/m/0170z3
/m/01sl1q	/award/award_winner/awards_won./award/award_honor/award_winner	/m/044mz_
/m/0cnk2q	/soccer/football_team/current_roster./sports/sports_team_roster/position	/m/02nzb8

Preparing the data

Then, the script converts the edge lists to PBG's input format. This amounts to assigning a numerical identifier to all entities and relation types, shuffling and partitioning the entities and edges and writing all down in the right format.

Luckily, there is a command that does all of this:

torchbiggraph_import_from_tsv \
  --lhs-col=0 --rel-col=1 --rhs-col=2 \
  torchbiggraph/examples/configs/fb15k_config_cpu.py \
  data/FB15k/freebase_mtr100_mte100-train.txt \
  data/FB15k/freebase_mtr100_mte100-valid.txt \
  data/FB15k/freebase_mtr100_mte100-test.txt

The outputs will be stored next to the inputs in the data/FB15k directory.

This simple utility is only suitable for small graphs that fit entirely in memory. To handle larger data one will have to implement their own custom preprocessor.

Training

The torchbiggraph_train command is used to launch training. The training parameters are tucked away in a configuration file, whose path is given to the command. They can however be overridden from the command line with the --param flag. The sample config is used for both training and evaluation, so we will have to use the override to specify the edge set to use.

torchbiggraph_train \
  torchbiggraph/examples/configs/fb15k_config_cpu.py \
  -p edge_paths=data/FB15k/freebase_mtr100_mte100-train_partitioned

This will read data from the entity_path directory specified in the configuration and the edge_paths directory given on the command line. It will write checkpoints (which also double as the output data) to the checkpoint_path directory also defined in the configuration, which in this case is model/fb15k.

Training will proceed for 50 epochs in total, with the progress and some statistics logged to the console, for example:

Starting epoch 1 / 50, edge path 1 / 1, edge chunk 1 / 1
Edge path: data/FB15k/freebase_mtr100_mte100-train_partitioned
still in queue: 0
Swapping partitioned embeddings None ( 0 , 0 )
( 0 , 0 ): Loading entities
( 0 , 0 ): bucket 1 / 1 : Processed 483142 edges in 17.36 s ( 0.028 M/sec ); io: 0.02 s ( 542.52 MB/sec )
( 0 , 0 ): loss:  309.695 , violators_lhs:  171.846 , violators_rhs:  165.525 , count:  483142
Swapping partitioned embeddings ( 0 , 0 ) None
Writing partitioned embeddings
Finished epoch 1 / 50, edge path 1 / 1, edge chunk 1 / 1
Writing the metadata
Writing the checkpoint
Switching to the new checkpoint version

GPU Training

Warning: GPU Training is still experimental; expect sharp corners and lack of documentation.

torchbiggraph_example_fb15k will automatically detect if a GPU is available and run with the GPU training config. For your own training runs, you will need to change a few parameters to enable GPU training. Lets see how the two FB15k configs differ:

$ diff torchbiggraph/examples/configs/fb15k_config_cpu.py torchbiggraph/examples/configs/fb15k_config_gpu.py
37a38
>         batch_size=10000,
42a44,45
>         # GPU
>         num_gpus=1,

The most important difference is of course num_gpus=1, which says to run on 1 GPU. If num_gpus=N>1, PBG will recursively shard the embeddings within each partition into N subpartitions to run on multiple GPUs. The subpartitions need to fit in GPU memory, so if you get CUDA out-of-memory errors, you'll need to increase num_partitions or num_gpus.

The next most important difference for GPU training is that batch_size must be much larger. Since training is being performed on a single GPU instead of 40 cores, the batch size can be increased by about that factor as well. We suggest batch size of around 100,000 in order to achieve good speeds for GPU training.

Since evaluation still occurs on CPU, we suggest turning down eval_fraction to at most 0.01 so that evaluation does not become a bottleneck (not relevant for FB15k which doesn't do eval during training).

Finally, to take advantage of GPU speed, we suggest turning up num_uniform_negatives and/or num_batch_negatives to about 1000 rather than their default values of 50 (FB15k already uses 1000 uniform negatives).

Evaluation

Once training is complete, the entity embeddings it produced can be evaluated against a held-out edge set. The torchbiggraph_example_fb15k command performs a filtered evaluation, which calculates the ranks of the edges in the evaluation set by comparing them against all other edges except the ones that are true positives in any of the training, validation or test set. Filtered evaluation is used in the literature for FB15k, but does not scale beyond small graphs.

The final results should match the values of mrr (Mean Reciprocal Rank, MRR) and r10 (Hits@10) reported in the paper:

Stats: pos_rank:  65.4821 , mrr:  0.789921 , r1:  0.738501 , r10:  0.876894 , r50:  0.92647 , auc:  0.989868 , count:  59071

Evaluation can also be run directly from the command line as follows:

torchbiggraph_eval \
  torchbiggraph/examples/configs/fb15k_config_cpu.py \
  -p edge_paths=data/FB15k/freebase_mtr100_mte100-test_partitioned \
  -p relations.0.all_negs=true \
  -p num_uniform_negs=0

However, filtered evaluation cannot be performed on the command line, so the reported results will not match the paper. They will be something like:

Stats: pos_rank:  234.136 , mrr:  0.239957 , r1:  0.131757 , r10:  0.485382 , r50:  0.712693 , auc:  0.989648 , count:  59071

Converting the output

During preprocessing, the entities and relation types had their identifiers converted from strings to ordinals. In order to map the output embeddings back onto the original names, one can do:

torchbiggraph_export_to_tsv \
  torchbiggraph/examples/configs/fb15k_config.py \
  --entities-output entity_embeddings.tsv \
  --relation-types-output relation_types_parameters.tsv

This will create the entity_embeddings.tsv file, which is a text file where each line contains the identifier of an entity followed respectively by the components of its embedding, each in a different column, all separated by tabs. For example, with each line shortened for brevity:

/m/0fphf3v	-0.524391472	-0.016430536	-0.461346656	-0.394277513	0.125605106	...
/m/01bns_	-0.122734159	-0.091636233	0.506501377	-0.503864646	0.215775326	...
/m/02ryvsw	-0.107151665	0.002058491	-0.094485454	-0.129078045	-0.123694092	...
/m/04y6_qr	-0.577532947	-0.215747222	-0.022358289	-0.352154016	-0.051905245	...
/m/02wrhj	-0.593656778	-0.557167351	0.042525314	-0.104738958	-0.265990764	...

It will also create a relation_types_parameters.tsv file which contains the parameters of the operators for the relation types. The format is similar to the above, but each line starts with more key columns containing, respectively, the name of a relation type, a side (lhs or rhs), the name of the operator which is used by that relation type on that side, the name of a parameter of that operator and the shape of the parameter (integers separated by x). These columns are followed by the values of the flattened parameter. For example, for two relation types, foo and bar, respectively using operators linear and complex_diagonal, with an embedding dimension of 200 and dynamic relations enabled, this file could look like:

foo	lhs	linear	linear_transformation	200x200	-0.683401227	0.209822774	-0.047136042	...
foo	rhs	linear	linear_transformation	200x200	-0.695254087	0.502532542	-0.131654695	...
bar	lhs	complex_diagonal	real	200	0.263731539	1.350529909	1.217602968	...
bar	lhs	complex_diagonal	imag	200	-0.089371338	-0.092713356	0.025076168	...
bar	rhs	complex_diagonal	real	200	-2.350617170	0.529571176	0.521403074	...
bar	rhs	complex_diagonal	imag	200	0.692483306	0.446569800	0.235914066	...

Pre-trained embeddings

We trained a PBG model on the full Wikidata graph, using a translation operator to represent relations. It can be downloaded here (36GiB, gzip-compressed). We used the truthy version of data from here to train our model. The model file is in TSV format as described in the above section. Note that the first line of the file contains the number of entities, the number of relations and the dimension of the embeddings, separated by tabs. The model contains 78 million entities, 4,131 relations and the dimension of the embeddings is 200.

Documentation

More information can be found in the full documentation.

Communication

  • GitHub Issues: Bug reports, feature requests, install issues, etc.
  • The PyTorch-BigGraph Slack is a forum for online discussion between developers and users, discussing features, collaboration, etc.

Citation

To cite this work please use:

@inproceedings{pbg,
  title={{PyTorch-BigGraph: A Large-scale Graph Embedding System}},
  author={Lerer, Adam and Wu, Ledell and Shen, Jiajun and Lacroix, Timothee and Wehrstedt, Luca and Bose, Abhijit and Peysakhovich, Alex},
  booktitle={Proceedings of the 2nd SysML Conference},
  year={2019},
  address={Palo Alto, CA, USA}
}

License

PyTorch-BigGraph is BSD licensed, as found in the LICENSE.txt file.

pytorch-biggraph's People

Contributors

adamlerer avatar amyreese avatar chandlerzuo avatar dmitryvinn avatar ezyang avatar facebook-github-bot avatar ferhatelmas avatar gandalf2390 avatar ivaniadg avatar jiajunshen avatar ledw avatar lw avatar nielsrogge avatar parsa-saadatpanah avatar r-barnes avatar rkindi avatar rom1504 avatar simonepri avatar stanislavglebik avatar tmarkovich avatar zpao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch-biggraph's Issues

Unbalanced entity category & too small embedding parameters

Hello,
First of all, thank you for this amazing project.

I have employed biggraph to train embedding vectors, but the result was not satisfied. So I come to ask for you advice.

The graph I used has 2 entity categories, say, items and users. But the amount of them is quite different. There are 700,000 items and 10,000,000 users. The graph has 70,000,000 edges. Only 1 relationship in there, a user clicked a item. Vector dimension is set as 128, I trained 30 epoch.

The final embedding vectors also confused me a lot, which have too small parameters, like this:
\46XC28261954309 0.000960 -0.002464 0.001338 -0.000077 0.001412 -0.000633 0.002548 -0.000783 0.001233......

So, I want to ask two questions:

  1. Dose the unbalanced entity category influence the result of BigGraph's ? Or to say, can this kind of graph be properly trained to get good embedding vectors ?
  2. Why the vectors I got have too small parameters? I am sure I did not change BigGraph's model configuration. And, I also want to know, dose this small vector parameters influence the quality of embedding vectors?

Looking for your reply,
Thanks!

How relation parameters is updated

In paper, relation parameters is will be updated to parameter server, while I read the code and didn't find the client is interact with the server after

    # start communicating shared parameters with the parameter server
    if parameter_sharer is not None:
        parameter_sharer.share_model_params(model)

in train.py, line 500.

I wonder where the updated is performed. Thank you

How to use edge weights?

Steps to reproduce

  1. Tried to input the edge weights along with the positive edge tsv file
  2. TSV had edge weights in the 4th column
  3. Any explanation on how to use edge weights would be helpful

Observed Results

Error While converting input data

Relevant Code

convert_input_data( args.config, edge_paths, lhs_col=0,rhs_col=2, rel_col=1, weight=3)

[Question]how to config my model if i parted the point set

hello, i'd like to ask if i split one type entities like TpyeA into 2 parts, can i just config like this:

"entities": {
    "A1": {
      "num_partitions": 1,
      "featurized": false
    },
    "A2": {
      "num_partitions": 2,
      "featurized": false
    },
    "B": {
      "num_partitions": 3,
      "featurized": false
    }
},
"relations": [
    {
      "name": "A1_A1",
      "lhs": "A1",
      "rhs": "A1",
      "weight": 1.0,
      "operator": "complex_diagonal",
      "all_negs": false
    },
    {
      "name": "A1_A2",
      "lhs": "A1",
      "rhs": "A2",
      "weight": 1.0,
      "operator": "complex_diagonal",
      "all_negs": false
    },
    {
      "name": "A2_A1",
      "lhs": "A2",
      "rhs": "A1",
      "weight": 1.0,
      "operator": "complex_diagonal",
      "all_negs": false
    },
    {
      "name": "A1_A1",
      "lhs": "A2",
      "rhs": "A2",
      "weight": 1.0,
      "operator": "complex_diagonal",
      "all_negs": false
    },
    {
      "name": "A1_B",
      "lhs": "A1",
      "rhs": "B",
      "weight": 1.0,
      "operator": "complex_diagonal",
      "all_negs": false
    },
    {
      "name": "A2_B",
      "lhs": "A2",
      "rhs": "B",
      "weight": 1.0,
      "operator": "complex_diagonal",
      "all_negs": false
    },
    {
      "name": "B_A1",
      "lhs": "B",
      "rhs": "A1",
      "weight": 1.0,
      "operator": "complex_diagonal",
      "all_negs": false
    },
    {
      "name": "B_A2",
      "lhs": "B",
      "rhs": "A2",
      "weight": 1.0,
      "operator": "complex_diagonal",
      "all_negs": false
    },
    {
      "name": "B_B",
      "lhs": "B",
      "rhs": "B",
      "weight": 1.0,
      "operator": "complex_diagonal",
      "all_negs": false
    },
  ],

i think the relation like A1_B and A2_B should be the same relation, but i make them separated, i'd like to know if this has any effect to embedding result.

[ERROR] Parse edge_paths failed

Environment

python 3.7.1
anaconda、pytorch 1.0.1
torchbiggraph install by

pip install torchbiggraph

Steps to reproduce

torchbiggraph_train \
  examples/configs/fb15k_config.py \
  -p edge_paths=data/FB15k/freebase_mtr100_mte100-train_partitioned

Observed Results

Error in the configuration file, aborting.
edge_paths: Not a list

Expected Results

succeed to run the training process

Getting nan as loss

Observed Results

image

operator = 'translation'
loss_fn = 'ranking'
comparator = 'cos'

Expected Results

Loss must be always a number

Relevant Code

It does not happen deterministically, but I'll try to provide a repro code.

fb15k.py hang on Mac

Steps to reproduce

  1. pip install biggraph
  2. git clone the repo
  3. run fb15k.py by "python examples/fb15k.py"

Observed Results

  • What happened? This could be a description, log output, etc.

It hangs on Mac,

2019-04-04 13:58:22 Loading entity counts...
2019-04-04 13:58:22 Creating workers...
2019-04-04 13:58:22 Initializing global model...
2019-04-04 13:58:22 Starting epoch 1 / 50 edge path 1 / 1 edge chunk 1 / 1
2019-04-04 13:58:22 edge_path= data/FB15k/freebase_mtr100_mte100-train_partitioned
still in queue: 0
2019-04-04 13:58:22 Swapping partitioned embeddings None ( 0 , 0 )
2019-04-04 13:58:22 Loading entities
7.67MB [00:29, 397kB/s]

Expected Results

  • What did you expect to happen?

Expect to run through

Relevant Code

// TODO(you): code here to reproduce the problem

Python 3.7.1
macOS High Sierra Version 10.13.6

Any example show how to do distributed training with data resident on HDFS?

Thanks for open-sourcing the great project!
It there any example shows how to do distributed training, especially with data located on HDFS?
It seems PBG currently loads data from the local file system.
But for extreme large graphs (with trillion edges) loading data from the local file system is unpractical,
Is there any plan to add HDFS support?

Thanks.

Using pre-trained WikiData embeddings for nearest neighbor search

I downloaded the 36 GB gzipped file (wikidata_translation_v1.tsv.gz) containing the pre-trained WikiData embeddings listed on this page: https://torchbiggraph.readthedocs.io/en/latest/pretrained_embeddings.html
I unzipped it and I see that the actual tsv file is 100+ GB.

I would like to utilize these embeddings to perform nearest neighbor search. But to perform this search, I would need to use the same comparator and operator that was used to train these, right?

I see on the page that the comparator used was dot and the operator used was translation. Is the learned translation vector already added to the embeddings in the tsv file or do I need to fetch it from some place and manually add it to the embeddings before I perform the dot product?

Also, I see that dynamic_relations is set to True. I haven't read the section on Dynamic Relations in the docs in detail yet, but it looks like the operator is applied to right-hand side entities under "normal" circumstances and dynamic_relations=True is not a normal circumstance? If so, in this circumstance, which side should the translation operator be applied to: left, right or both?

Also, there is a small example on Nearest Neighbor Search using the FAISS library in the Downstream Tasks section in the docs, although the example does not use the WikiData embeddings. Given how large the WikiData embeddings file is, it would be great if you could share a variant of the same FAISS example that uses the WikiData embeddings!

Are the relations between entities not the difference of the entity vectors?

Using some links from Lerks -- #13 -- I tried searching for the closest relations between entities by taking the difference between the entity vectors and doing a search with faiss. However, the results seem wrong; how else should I approach this problem?

import json
import numpy as np
import pickle
import faiss


with open('wikidata_translation_v1_names.json') as fd:
    file = json.load(fd)

index_to_relation = {}
for i in range(len(file[78404883:])):
    index_to_relation[i] = file[i+78404883]

id_to_index = {}
for i in range(len(file)):
    id_to_index[file[i]] = i

all_embeddings = np.load("wikidata_translation_v1_vectors.npy", mmap_mode = 'r')

class Constants:
    ENTITIES = all_embeddings[0:78404883]
    RELATIONS = all_embeddings[78404883:]

relations_index = faiss.IndexFlatL2(200)
relations_index.add(Constants.RELATIONS)

def closest_relation(key1, key2, k, index, id_to_index, index_to_relation):
    """
    key_id: a string representing an entitiy; ex: '<http://www.wikidata.org/entity/Q37079688>' 
    k: an int that indicates how many closest neighbors to search for
    id_to_index: dictionary that maps id to index
    index_to_id: dictionary that maps index to id
    """
    result = []

    vector1 = np.array([Constants.ENTITIES[id_to_index[key1]]])
    vector2 = np.array([Constants.ENTITIES[id_to_index[key2]]])
    
    vector = vector1 - vector2
    
    _, neighbors = index.search(vector, k)
    
    for i in range(k):
        temp = index_to_relation[neighbors[0,i]]
        result.append(temp)
    
    return result

key1 = '<http://www.wikidata.org/entity/Q76>'
key2 = '<http://www.wikidata.org/entity/Q207>'

print(closest_relation(key1,key2,5,relations_index,id_to_index,index_to_relation))

Here are the results:

['<http://www.wikidata.org/prop/direct/P1603>_reverse_relation', '<http://www.wikidata.org/prop/direct/P2892>_reverse_relation', '<http://www.wikidata.org/prop/direct/P1164>_reverse_relation', '<http://www.wikidata.org/prop/direct/P1060>_reverse_relation', '<http://www.wikidata.org/prop/direct/P2320>_reverse_relation']

Is there any reason why PBG can not be used to train word2vec embeddings?

Sorry for the newbie question, but I have looked up graph embeddings, this repo, and the PBG paper, and it looks like this can be used to train word2vec embeddings (whole other story of why I want to train word2vec on this). I just have to think about it as treating a target word's index as the main node, and the context word's index as the adjacent nodes.

Or, am I missing something big here?

[ERROR] Bus error (core dumped)

My environment:python 3.6.8 ::anaconda、pytorch 1.0.1.post2、cuda 9.0
I follow the guide
pip install torchbiggraph
examples/livejournal.py
And then i meet an error “Bus error (core dumped)”
Can you tell me how to fix this?

//my terminal's output
Using downloaded and verified file: data/soc-LiveJournal1.txt.gz
Extracting data/soc-LiveJournal1.txt.gz
Downloaded and extracted file.
Shuffling and splitting train/test file. This may take a while.
Reading data from file: data/soc-LiveJournal1.txt
Shuffling data
Splitting to train and test files
Found some files that indicate that the input data has already been preprocessed, not doing it again.
These files are: data/livejournal/dictionary.json, data/livejournal/entity_count_user_id_0.txt, data/train_partitioned/edges_0_0.h5, data/test_partitioned/edges_0_0.h5
2019-04-16 11:58:14 Loading entity counts...
2019-04-16 11:58:14 Creating workers...
2019-04-16 11:58:14 Initializing global model...
Bus error (core dumped)

Idea: Double sided operators and sequence of operators

Have you explored the idea of providing the ability to specify more than one operation for each relation?

Something like that:

'operators': [{
  'type': 'linear',
  'side': 'both', # Apply this operator on both lhs and rhs
}, {
  'type': 'translation',
  'side': 'left',  # Apply this operator on lhs only
}]

To give some context, my final goal is to model TransD. In TransD the embeddings of the entities need to be projected into another embedding space before having the translation operator applied.



I've forked the repo and done part of the changes needed to achieve this but I'd like to discuss whether there's interest in having something like that into BigGraph.

Wikidata Embeddings have the description as the key instead of the QID

I downloaded the pretrained wikidata embeddings, and i see lines like:

"scientific article published on January 1986"@en-gb	-0.3222	-0.6460	-0.1054	0.2531 ...

This seems to be the description of the wikidata entity. I think the QID, e.g. Q2 for earth, should be used as the unique identifier as even the entity title is not necessarily unique.

How to use trained embeddings for relation prediction?

I am going through BIGGRAPH paper and found it interesting. I have trained embedding on FB15k dataset now how can I use those trained embeddings to predict the relation on unknown dataset?

As I can see you have shown result with R-GCN and other models on FB15k dataset.
So after training if i have two nodes and I want to get edge between them which contains the information ( url, relations etc ) how can I do that?

Which models did you use in paper?

Thank you for awesome work :)

[Question] Featurized Entities

First of all, thank you for this amazing project.

After training different embeddings, I recently started to look at featurized entities. To be more precise, some of the entities in my graph represent text documents and I already trained the corresponding word embeddings. I do know that it’s an advanced feature which hasn’t fully stabilized yet, i.e. that I have to implement my own converter based on the format described in the documentation. However, I was wondering whether you could please share an example of a converter for featurized entities? I think it would be helpful to look at such an example.

Thanks!

How to speed up import_from_tsv when Preparing the data

I have lots of tsv data (~800G), I found when I use import_from_tsv to process the data with on machine(data 40G), I took 4 hours.

So If it want to process 800G data. The time will be 20 *4 80 hours ?
I donot know whether my machine (240G mem) can process the data.
And It will take lots time.

If some thing wrong (for example: the data process is killed) . I need do this from the begin ?

[Question] User graph embedding

I am training a user identity graph containing user events and interactions. Nodes have two types: Users and properties. Properties represent user cookies, IP address, pageviews etc. I am not sure if that would require featurized entities as so far I am not able to converge on a small dataset of around 300K edges.

I guess there is something fundamentally missing on my setup and I would like to have any troubleshooting guidance on how to converge.

Most of the graph I'm training on consists of isolated nodes with some properties. So they are not connected to the rest of the graph.

Here is the result of 60 epochs:
loss 0.0132791. violators_lhs: 0.00924837 , violators_rhs: 0.00964025
pos_rank: 482.742 , mrr: 0.0115865 , r1: 0.00390625 , r10: 0.0117188 , r50: 0.0664062 , auc: 0.765625

Thanks a lot

Using it with undirected graph

Hello,
I'd like to ask how this can be used for a simple undirected graph(without any relationship type)
For example, given a social network graph, where each edge always represents a friendship, would it be possible to produce embeddings?
Can I directly replace the relationship type parameter with "friend" for all edges?

Regards,
Aditya Malte

[IndexError] with using custom data !

I am really excited to try your promising software .
I created a data file composed of (linkedin_profile ids,relations,informations)

relation such as : skilled_in,interested_in,works_in...
profile information such as : (skills,company,interests ....)

i used the same config file of the fb15k data ( i have already checked the config file and it looks like nothing to change for my data )
So , i just runned this command :
!torchbiggraph_import_from_tsv \ --lhs-col=0 --rel-col=1 --rhs-col=2 \ PyTorch-BigGraph/torchbiggraph/examples/configs/fb15k_config.py \ mydatafile.txt

But , i got this error:

Looking up relation types in the edge files...
- Found 3 relation types
- Removing the ones with fewer than 1 occurrences...
- Left with 3 relation types
- Shuffling them...
Searching for the entities in the edge files...
Traceback (most recent call last):
  File "/usr/local/bin/torchbiggraph_import_from_tsv", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/torchbiggraph/converters/import_from_tsv.py", line 363, in main
    opt.relation_type_min_count,
  File "/usr/local/lib/python3.6/dist-packages/torchbiggraph/converters/import_from_tsv.py", line 275, in convert_input_data
    entity_min_count,
  File "/usr/local/lib/python3.6/dist-packages/torchbiggraph/converters/import_from_tsv.py", line 95, in collect_entities_by_type
    counters[relation_configs[rel_id].rhs][words[rhs_col]] += 1
IndexError: list index out of range

i runned this again but with only the first 100 lines of my data file,
and it works this time ! What happened ❓

Thank You !

[Question] about the pre-trained wikidata embeddings

Hi, thanks for open sourcing this useful tool! I wanted to know more details about the pre-trained wikidata embeddings.

  • What was the loss function used for training?
  • What was the negative sampling scheme and how many negative samples were used?

Thanks!

Forming sentence embedding

Sorry if it sounds silly and not the right place to ask it. But I felt it could solve it better than posting it anywhere else. I am little confused with the concept, Graph embedding gives a numerical representation to the graph. So if I can create sentence as a graph of words(embedding of them taken from Word2Vec or FastText or Glove) as nodes connected based on occurrence and distance between the nodes as cosine similarity of them. With this can I consider the graph embedding as a the sentence embedding of the sentence? And with the similar concept can I construct embedding for paragraphs and documents with their respective embeddings? Will this be a right approach?

Running example results in lower evaluation scores than stated on GitHub readme

I followed the basic instructions in the GitHub Readme, however, the results I get are lower than described in the Readme:

The last few lines of the output when running the eval script:

...
019-07-11 11:16:28  WARNING: Adding uniform negatives makes no sense when already using all negatives
2019-07-11 11:16:28  ( 0 , 0 ): Processed 59071 edges in 5.6 s (0.011M/sec); load time: 0.0015 s
2019-07-11 11:16:28  Stats for edge path 1 / 1, bucket ( 0 , 0 ): pos_rank:  209.651 , mrr:  0.245 , r1:  0.136133 , r10:  0.492543 , r50:  0.720726 , auc:  0.989885 , count:  59071
2019-07-11 11:16:28
2019-07-11 11:16:28  Stats for edge path 1 / 1: pos_rank:  209.651 , mrr:  0.245 , r1:  0.136133 , r10:  0.492543 , r50:  0.720726 , auc:  0.989885 , count:  59071
2019-07-11 11:16:28
2019-07-11 11:16:28
2019-07-11 11:16:28  Stats: pos_rank:  209.651 , mrr:  0.245 , r1:  0.136133 , r10:  0.492543 , r50:  0.720726 , auc:  0.989885 , count:  59071

The Readme states that the last line should be:
Stats: pos_rank: 65.4821 , mrr: 0.789921 , r1: 0.738501 , r10: 0.876894 , r50: 0.92647 , auc: 0.989868 , count: 59071

Are there any explanations?

I think this issue was already raised in #18 (but was misunderstood as pointing out a discrepancy with the paper)

Enforcing max_norm on entities and operators

My understanding is that currently, the max_norm option allows users to enforce the norm of the embeddings learned for the entities to be smaller than a specific value.

If that's correct I have to two questions:

  • In operators like complex_diagonal where the embedding learned is made of two components, how can one enforce max_norm on the two separate parts? (enforce max_norm on the first half and max_norm on the second half)

  • In case there's the need to enforce max_norm also on the parameters learned for the operator, how can one achieve that?

can not install torch

Steps to reproduce

  1. pip install torchbiggraph

Observed Results

Could not find a version that satisfies the requirement torch>=1.0 (from torchbiggraph) (from versions: 0.1.2, 0.1.2.post1)
No matching distribution found for torch>=1.0 (from torchbiggraph)

Expected Results

  • What did you expect to happen?

Relevant Code

// TODO(you): code here to reproduce the problem

[ERROR] Bus error (core dumped)

My environment:python 3.6.8 ::anaconda、pytorch 1.0.1.post2、cuda 9.0
I follow the guide
pip install torchbiggraph
examples/livejournal.py
And then i meet an error “Bus error (core dumped)”
Can you tell me how to fix this?

[Question] About Model Inference and Wikidata SPARQL

Thank you for this amazing project and pre-trained model.
Typically when dealing with a Wikidata RDF, one will use SPARQL as interface to the RDF triplets model, so that it will be possible to get items with their relations, etc.
That said, assumed one can load the Wikidata model you have released, it should be possibile to infer an embedding for a specified source and target nodes and their relations.
So the questions is: Having this model the neighbors node learned already , is it possible to get the nearest nodes directly (like in the nn api of Word2Vec to be clear) or it must be used an Approximated Nearest Neighbor like FAISS or Annoy to get the closer embedding vectors?
Thank you.

error running torchbiggraph_live_journal

Steps to reproduce

  1. run the shell script put into /usr/bin/torchbiggraph_live_journal (note that it creates an uncompressed file that script expects to be compressed).
  2. fails in the import_from_tsv.py in generate_edge_path_files

Observed Results

  • What happened? This could be a description, log output, etc.

Expected Results

I am trying to develop an intuition between an entity_type vs an entity_name and how many partitions should be devoted to a given node type (config file just gives an example of 1).
The point of running this is to shard a graph that can't be held in memory and train in parallel on . a multi-core conventional cpu. please advise

  • What did you expect to happen?
    I am confused about your terminology I think

Relevant Code

// TODO(you): code here to reproduce the problem

Feature request: Initial embeddings

Having a feature to allow the user to define initial embeddings easily would be very helpful!
Currently the only easy work-about is through modifying checkpoint files (and using init_path) but less convoluted methods are not yet available.

For streaming behaviour data (think of reddit and twitter), it's meaningful to allow embeddings to change a little but also not too much from the previous training cycle. Also new data needs to be accommodated.

It would be cool if train.py could accept custom initial embeddings and create embeddings for new entities in edge_list.

Thanks FAIR for open sourcing this awesome tool!

No module named 'filtered_eval'

Steps to reproduce

Executing the script:

https://github.com/facebookresearch/PyTorch-BigGraph/blob/master/examples/fb15k.py

shows following error


 $ python fb15k.py
Traceback (most recent call last):
  File "pg.py", line 13, in <module>
    from filtered_eval import FilteredRankingEvaluator
ModuleNotFoundError: No module named 'filtered_eval'  
pip install torchbiggraph
Requirement already satisfied: torchbiggraph in ./.local/lib/python3.6/site-packages (1.dev0)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.6/site-packages (from torchbiggraph) (4.31.1)
Requirement already satisfied: torch>=1.0 in ./.local/lib/python3.6/site-packages (from torchbiggraph) (1.0.1.post2)
Requirement already satisfied: h5py in /opt/conda/lib/python3.6/site-packages (from torchbiggraph) (2.9.0)
Requirement already satisfied: attrs>=18.2.0 in /opt/conda/lib/python3.6/site-packages (from torchbiggraph) (19.1.0)
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from torchbiggraph) (1.16.2)
Requirement already satisfied: six in /opt/conda/lib/python3.6/site-packages (from h5py->torchbiggraph) (1.12.0)

Save the embeddings after every nth epoch?

Hello,

How can I save the embeddings after every nth epoch? suppose I am trying to train model on 1000 epochs but same time I want to save the model/embeddings after every 100 epoch.

How can I do that?

Unable to reproduce fb15k results from downstream tasks

I have ran the command
torchbiggraph_example_fb15k
in order to try and reproduce "Predicting the score of an edge" and "Ranking" result from this link https://torchbiggraph.readthedocs.io/en/latest/downstream_tasks.html#using-the-embeddings
However, for the edge score I get this result:
tensor([[21.2358]], grad_fn=<SumBackward2>)
I expected a number between 0 and 1 and close to 1 (because this statement is true).
The scores do make sense because for Berlin being a capital of France I get 14.6, and for Barack Obama being a capital of France I get score 3.25.
I assume some sort of normalization function needs to be applied in order to normalize these scores?

For the Ranking example, I get 5 results:
['/m/0f8l9c', '/m/05qtj', '/m/01c6rd', '/m/09b83', '/m/0p0mx']
The first one is France, and the second one is Paris -- I didn't expect to see France here since we are determining the capital of France.

Not sure if I'm missing something here, I used the scripts from the link without changing them.

Thanks!

[Question] Does PBG support weighted edges?

It doesn't seem clear if the algorithm supports weighted edges. If it does should the edge simply be repeated in the input file multiple times?
I apologize if it is somewhere in the documentation and I missed it

[Question] Batch the training

Hi, i'm new to PBG , so i don't know about all the features and the functionalities .
i have just processed my edges file with this command :

torchbiggraph_import_from_tsv \
  --lhs-col=0 --rel-col=1 --rhs-col=2 \
  PyTorch-BigGraph/torchbiggraph/examples/configs/fb15k_config.py \
  edges.txt

Output:

Looking up relation types in the edge files...

  • Found 727 relation types
  • Removing the ones with fewer than 1 occurrences...
  • Left with 727 relation types
  • Shuffling them...
    Searching for the entities in the edge files...
    Entity type all:
  • Found 3664382 entities
  • Removing the ones with fewer than 1 occurrences...
  • Left with 3664382 entities
  • Shuffling them...
    Preparing entity path data/FB15k:
  • Writing count of entity type all and partition 0
  • Writing count of dynamic relations
    ...

Then , i start the training :
torchbiggraph_train \ PyTorch-BigGraph/torchbiggraph/examples/configs/fb15k_config.py \ -p edge_paths=edges_partitioned

Output :

2019-06-11 09:55:28 Loading entity counts...
2019-06-11 09:55:28 Creating workers...
2019-06-11 09:55:28 Initializing global model...
2019-06-11 09:55:32 Starting epoch 1 / 50 edge path 1 / 1 edge chunk 1 / 1
2019-06-11 09:55:32 edge_path= edges_partitioned
still in queue: 0
2019-06-11 09:55:32 Swapping partitioned embeddings None ( 0 , 0 )
2019-06-11 09:55:32 Loading entities

So , the training exited when the process start loading entites .
i'm pretty sure it's a memory problem . is it possible to batch the training and do not load all the entities at once ?

Thank You !

[Question] about the reported entity number on wikidata

Hello, thanks for your excellent work on large-scale pretrained graph embeddings. But I have a small question here:

The reported number of entities on Wikidata is 78 million. However, the number on the Wikidata website is only 56 million and I only extracted 54 million entities using the April 2019 dump. Can you give some details about how you extract entities from the dump?

Thanks!

[Question] configuration for reproducing TransE on FB15K

I have already reproduced the result of ComplEx on FB15k, following the example config. (thank you for your works on this great framework, it's really good to use and saves a lot of effort.)

But when I try to implement TransE model as the document suggests, changing operator to translation and comparator to cos and loss_fn to ranking , the result(run by examples/fb15k.py) is much lower than expected.

Playing with eval_fraction option, it seems to me that 50 epoch is not enough for the model to coverage. How many epoch is enough?
Or maybe my margins param should not be default? Then what should I set margins to ?
My configuration details are listed below.

Thanks you !

Observed Results

this is filtered result under examples/fb15k.py

Stats: pos_rank:  97.5809 , mrr:  0.245797 , r1:  0.0258841 , r10:  0.651868 , r50:  0.858958 , auc:  0.983393 , count:  59071

this is no-filtered result

Stats for edge path 1 / 1: pos_rank:  249.24 , mrr:  0.152477 , r1:  0.0205685 , r10:  0.438252 , r50:  0.701723 , auc:  0.982343 , count:  59071

by the way , the final epoch(50th) loss is 12.0212

Expected Results

In the paper(Table 2), the TransE result is MRR=0.594, Hit@10=0.785

My configuration

my config file

def get_torchbiggraph_config():

    config = dict(
        entity_path=entity_base,

        num_epochs=50,

        entities={
            'all': {'num_partitions': 1},
        },

        relations=[{
            'name': 'all_edges',
            'lhs': 'all',
            'rhs': 'all',
            'operator': 'translation',
        }],
        dynamic_relations=True,

        edge_paths=[],

        checkpoint_path='model/fb15k_noeval',

        dimension=400,
        global_emb=False,
        comparator='cos',
        loss_fn='ranking',
        lr=0.1,
        num_uniform_negs=1000,

        eval_fraction=0,  # to reproduce results, we need to use all training data
    )

    return config

my conda enviroment:

# packages in environment at /home/work/anaconda2/envs/biggraph:
#
# Name                    Version                   Build  Channel
attrs                     19.1.0                   py36_1
blas                      1.0                         mkl
ca-certificates           2019.1.23                     0
certifi                   2019.3.9                 py36_0
cffi                      1.12.2           py36h2e261b9_1
cudatoolkit               9.0                  h13b8566_0
cudnn                     7.3.1                 cuda9.0_0
h5py                      2.9.0            py36h7918eee_0
hdf5                      1.10.4               hb1b8bf9_0
intel-openmp              2019.3                      199
libedit                   3.1.20181209         hc058e9b_0
libffi                    3.2.1                hd88cf55_4
libgcc-ng                 8.2.0                hdf63c60_1
libgfortran-ng            7.3.0                hdf63c60_0
libstdcxx-ng              8.2.0                hdf63c60_1
mkl                       2019.3                      199
mkl_fft                   1.0.10           py36ha843d7b_0
mkl_random                1.0.2            py36hd81dba3_0
ncurses                   6.1                  he6710b0_1
ninja                     1.9.0            py36hfd86e86_0
numpy                     1.16.2           py36h7e9f1db_0
numpy-base                1.16.2           py36hde5b4d6_0
openssl                   1.1.1b               h7b6447c_1
pip                       19.0.3                   py36_0
pycparser                 2.19                     py36_0
python                    3.6.8                h0371630_0
pytorch                   1.0.1           cuda90py36h8b0c50b_0
readline                  7.0                  h7b6447c_5
setuptools                40.8.0                   py36_0
six                       1.12.0                   py36_0
sqlite                    3.27.2               h7b6447c_0
tk                        8.6.8                hbc83047_0
tqdm                      4.31.1                   py36_1
wheel                     0.33.1                   py36_0
xz                        5.2.4                h14c3975_4
zlib                      1.2.11               h7b6447c_3

Graph partitioning

Hello!
Thank you for this amazing project.
I'd like to ask if we can consider the embedding of each Partition as an embedding of sub-graph?

Regards,

AttributeError: 'EntityList' object has no attribute 'to_tensor'

Steps to reproduce

  1. pip install torchbiggraph
  2. python examples/fb15k.py
  3. show running

Observed Results

  • 2019-04-30 13:30:43 Exiting
    2019-04-30 13:30:43 Building links map from path data/FB15k/freebase_mtr100_mte100-test_partitioned
    Traceback (most recent call last):
    File "examples/fb15k.py", line 91, in
    main()
    File "examples/fb15k.py", line 85, in main
    do_eval(eval_config, FilteredRankingEvaluator(eval_config, filter_paths))
    File "/root/graph/PyTorch-BigGraph/examples/filtered_eval.py", line 56, in init
    cur_lhs = lhs.to_tensor()[i].item()
    AttributeError: 'EntityList' object has no attribute 'to_tensor'

Expected Results

  • should show it is finished successfully.

Relevant Code

// TODO(you): code here to reproduce the problem

Is the result of ComplEx model on FB15K reproducible?

Hi, I made an attempt to reproduce the result on FB15K mentioned in the README. However, the result on testset was much lower than what is mentioned in README. The FB15K dataset I used was downloaded from the link in README and the command and config file (examples/configs/fb15k_config.py) were just the same as in the repo. The training was finished on a single machine. Could you please figure out this issue?

Configs for other datasets in the paper

Hi, can I find the configs for other datasets like Youtube, Twitter mentioned in the paper? Information like the learning rate for the best result can be quite useful.

Does PBG support multi-relation for the same two type entity?

PBG is a very convenient tools for graph embedding, thanks for your work.

I have a doubt about the relation between the same two type entities

For example:
Two types entities, one represent user, another represent items.
The relation between user and items can be "click", "pay" ....

So, I wonder if the PBG support multi-relation for the same two type entities?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.