Giter Site home page Giter Site logo

el-long-tail-phenomena's Introduction

EL-long-tail-phenomena

This repository contains the entire code for the analysis reported in the following COLING 2018 paper:

  @inproceedings{ilievski2018systematic,
        title = {Systematic Study of Long Tail Phenomena in Entity Linking},
        author={Ilievski, Filip and Vossen, Piek and Schlobach, Stefan},
        booktitle={proceedings of COLING},
        year = {2018}}

This paper was presented at COLING 2018. Slides can be found here.

1. Description of the data structure

1.1. Notebooks

There are four main notebooks that carry out the reported analysis:

  • Load.ipynb loads the evaluation datasets, runs the entity linking systems on them, and stores the results in the bin/ folder.
  • Data Analysis.ipynb performs analysis of the datasets' properties.
  • System Analysis (Micro).ipynb performs analysis of the system performance measured with micro F1-score in relation to the data properties.
  • System Analysis (Macro).ipynb performs additional analysis of system performance, but now with macro F1-score as a metric.

1.2. Python files

There are several Python files with utility functions that are used in these notebooks:

  • classes.py contains definition of the main classes we deal with in this project. Above all, these are: EntityMention and NewsItem.
  • dataparser.py contains functions to load data in different formats, and produce class objects as instances.
  • load_utils.py contains additional functions that are helpful with the data loading process.
  • analysis_utils.py contains functions for manipulation and conversion of the data, for analysis, and for evaluation.
  • plot_utils.py has several plotting functions.

1.3. Directories

  • bin/ contains the news item objects as loaded from raw data, and their processed versions by each of the systems.
  • data/ contains the input datasets in their original format.
  • debug/ contains additional debugging logs or files.
  • img/ stores the plot images (in PNG format) created in this project as an output.

2. Before running the project

2.1. Dependencies

We use the usual Python modules for statistical analysis, data manipulation, and plotting: scipy, numpy, collections, matplotlib, seaborn, pickle, collections, pandas, rdflib, and lxml.

We also use the Redis database and its Python client to cache some data.

This project has been coded and tested on a computer with Python v3.6.

Hence, please make sure that:

  • You can run a Python version 3.6 or similar
  • You have installed the modules specified above.
  • You have installed Redis and its Python client on your machine

2.2. Preparing complementary DBpedia data: PageRank values, redirects, and disambiguation values

We pre-load three types of DBpedia data in a Redis database for quick access during the analysis. These are:

For replication purposes, please simply run the bash script prepare_redis_data.sh to: download these files, place them in the correct place locally, and cache them in Redis. Note that this script takes some time to execute (around 1.5-2 hours on my laptop).

2.3. Download evaluation data: N3 corpus and AIDA

Option 1: Obtain the data and prepare for processing

To run the entire project, including the retrieval of the system output by running the tools, users need to first obtain the datasets.

The N3-collection of datasets is publicly available on github. For your convenience, we prepared a script that downloads this collection and creates the directory assumed by the analysis notebooks. You can run this script without arguments, as follows:

sh prepare_n3_data.sh

The second data collection, AIDA, is unfortunately not publicly available. It needs to be obtained from the LDC catalog.

Option 2 (preferred): Skip running the tools and use pre-cached .bin files

To enable easier replicability and work around the proprietary dataset AIDA, we provide the .bin files that are created in the first step in this project by the Load.ipynb notebook. This allows users to avoid the obstacle of a non-public dataset, to avoid re-runing the EL linking systems and storing the data on disk themselves, while still enabling them to run the entire analysis that supports the paper findings and inspect the underlying data. Note that these .bin files contain only the set of entities with their mentions, thus still preserving the commercial aspect of AIDA.

3. Running the project

Section 4 of the paper is based on the notebook Data Analysis.ipynb. This notebook uses the dataset .bin objects as downloaded from github or pre-cached by the Load.ipynb notebook.

Section 5 is based mostly on the notebook System Analysis (Micro).ipynb, and complemented by some findings from the notebook System Analysis (Macro).ipynb. These two notebooks rely on the .bin objects that are downloaded from github or pre-cached by the Load.ipynb notebook, containing the system disambiguation output by all systems on the datasets.

Feel welcome to rerun the notebook to validate and inspect their results. All analysis notebooks run really quick (within a minute). The Load.ipynb notebook takes longer to run (around 2 hours in total) and expects one non-public dataset, AIDA (as discussed in 2.3), hence perhaps it is a good idea not to make use of the provided .bin files.

The green markers inside the notebooks help the reader relate the analysis in the notebook to the plots in the paper. For example, in Data analysis part 6) PageRank distribution of instances, we have a pointer "Section 4.2 of the paper".

4. Contact

For any questions or comments, please contact Filip via e-mail: [email protected].

el-long-tail-phenomena's People

Contributors

filievski avatar

Stargazers

 avatar Benno Kruit avatar JADEXIN avatar Kyoungrok Jang avatar

Watchers

James Cloos avatar Emiel van Miltenburg avatar  avatar Ruben Izquierdo avatar piek avatar Minh Le avatar  avatar Paul Huygen avatar Marten Postma avatar  avatar R.H. Segers avatar  avatar Pia Sommerauer avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.