Relational Data Embeddings for Feature Enrichment with Background Information

This repository contains code and data for the paper "Relational Data Embeddings for Feature Enrichment with Background Information".

1) The folder KEN contains the implementation of our approach, KEN, as described in the paper. It includes:

KEN/models/entity_embedding classes for the TransE, DistMult and MuRE knowledge-graph embedding models (based on the PyKEEN package).
KEN/models/numerical_embedding classes implementing our approach (a linear layer with ReLU activation, see linear2.py) and a binning approach to embed numerical values (binning.py).
KEN/sampling/pseudo_type.py an adaptation of PyKEEN PseudoTypedNegativeSampler, which replaces head entities by a random entity occuring in the same relation.
KEN/training/hpp_trainer.py a class to train embedding models with/without KEN, possibly with multiple hyperparameters. It also measures the time and memory needed for training, and save the results in a .parquet file.
KEN/baselines/dfs.py a class to perform Deep Feature Synthesis using the implementation from featuretools. It also measures the time/memory needed, and the number of generated features.
KEN/evualation/prediction_scores.py a set a functions to compute the cross-validation scores of embeddings / deep features on a target dataset.
KEN/dataloader/dataloader.py a class to load triples in the .npy format and convert them to a TriplesFactory object that can be used by PyKEEN.
KEN/dataloader/make_triples.py a function that takes as input tables/knowledge-graphs and turn them into a set triples saved with .npy format.

2) The folder experiments contains the datasets and the code to run our experiments.

experiments/model_training code to train embedding models (TransE, DistMult, MuRE, RDF2Vec), save them as checkpoints during training, and store metadata about checkpoints (parameters, time/memory complexity) in a .parquet file.
experiments/deep_feature_synthesis code to perform Deep Feature Synthesis, save the generated features, and store metadata (time/memory complexity, number of features) in a .parquet file.
experiments/manual_feature_engineering code to manually build features and store them in .parquet files.
experiments/prediction_scores code to compute cross-validation scores of all methods under study, and store the results (scores, time complexity) in .parquet files.
experiments/attribute_reconstruction code to compute cross-validation scores when reconstructing entities numerical attributes (e.g. county population) from their embeddings. We store the results in a .parquet file.
experiments/embedding_visualization code to visualize in 2D MuRE and MuRE + KEN embeddings trained on YAGO3.
experiments/results_visualization a set of functions to visualize the results of the experiments.

3) The datasets used in our experiments are available here in the form of a zip file.

The unzipped datasets folder should be placed in experiments.For each dataset xxx, experiments/datasets/xxx contains:

a file target.parquet that contains the entities of interest (e.g. counties, cities) and the target to predict.
a folder triplets that contains the training triples in .npy format and their metadata.

Install KEN using the setup.py file.
Run experiments (in order: model_training, deep_feature_synthesis, then prediction_scores and attribute_reconstruction)
To avoid re-running the experiments, we provide the result files used in the paper. You can visualize them with functions from experiments/results_visualization/results_visualization.py.

alexis-cvetkov / ken Goto Github PK