Giter Site home page Giter Site logo

ken's Introduction

Relational Data Embeddings for Feature Enrichment with Background Information

1. Structure of the repo

This repository contains code and data for the paper "Relational Data Embeddings for Feature Enrichment with Background Information".

1) The folder KEN contains the implementation of our approach, KEN, as described in the paper. It includes:

  • KEN/models/entity_embedding classes for the TransE, DistMult and MuRE knowledge-graph embedding models (based on the PyKEEN package).

  • KEN/models/numerical_embedding classes implementing our approach (a linear layer with ReLU activation, see linear2.py) and a binning approach to embed numerical values (binning.py).

  • KEN/sampling/pseudo_type.py an adaptation of PyKEEN PseudoTypedNegativeSampler, which replaces head entities by a random entity occuring in the same relation.

  • KEN/training/hpp_trainer.py a class to train embedding models with/without KEN, possibly with multiple hyperparameters. It also measures the time and memory needed for training, and save the results in a .parquet file.

  • KEN/baselines/dfs.py a class to perform Deep Feature Synthesis using the implementation from featuretools. It also measures the time/memory needed, and the number of generated features.

  • KEN/evualation/prediction_scores.py a set a functions to compute the cross-validation scores of embeddings / deep features on a target dataset.

  • KEN/dataloader/dataloader.py a class to load triples in the .npy format and convert them to a TriplesFactory object that can be used by PyKEEN.

  • KEN/dataloader/make_triples.py a function that takes as input tables/knowledge-graphs and turn them into a set triples saved with .npy format.


2) The folder experiments contains the datasets and the code to run our experiments.

  • experiments/model_training code to train embedding models (TransE, DistMult, MuRE, RDF2Vec), save them as checkpoints during training, and store metadata about checkpoints (parameters, time/memory complexity) in a .parquet file.

  • experiments/deep_feature_synthesis code to perform Deep Feature Synthesis, save the generated features, and store metadata (time/memory complexity, number of features) in a .parquet file.

  • experiments/manual_feature_engineering code to manually build features and store them in .parquet files.

  • experiments/prediction_scores code to compute cross-validation scores of all methods under study, and store the results (scores, time complexity) in .parquet files.

  • experiments/attribute_reconstruction code to compute cross-validation scores when reconstructing entities numerical attributes (e.g. county population) from their embeddings. We store the results in a .parquet file.

  • experiments/embedding_visualization code to visualize in 2D MuRE and MuRE + KEN embeddings trained on YAGO3.

  • experiments/results_visualization a set of functions to visualize the results of the experiments.


3) The datasets used in our experiments are available here in the form of a zip file.

The unzipped datasets folder should be placed in experiments.For each dataset xxx, experiments/datasets/xxx contains:

  • a file target.parquet that contains the entities of interest (e.g. counties, cities) and the target to predict.
  • a folder triplets that contains the training triples in .npy format and their metadata.

2. How to run experiments

  1. Install KEN using the setup.py file.
  2. Run experiments (in order: model_training, deep_feature_synthesis, then prediction_scores and attribute_reconstruction)
  3. To avoid re-running the experiments, we provide the result files used in the paper. You can visualize them with functions from experiments/results_visualization/results_visualization.py.

ken's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ken's Issues

How are the tabular datasets preprocessed?

First of all, great paper! I especially found the idea of converting tables to a graph interesting.

I was curious how the tabular data (in particular KDD14 and KDD15) are preprocessed. I noticed in the paper that you are "using target entities as head entities and linking them to other entries from the same rows". Does that mean you were only using the single table containing the entities you want to predict, ignoring the other tables?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.