A state of the art system for zero-shot entity fine typing with minimum supervision
This is a demo system for our paper "Zero-Shot Open Entity Typing as Type-Compatible Grounding", which at the time of publication represents the state-of-the-art of zero-shot entity typing.
The original experiments that produced all the results in the paper are done with a package written in Java. This is a re-written package that contains the same core, without experimental code. It's solely for the purpose of demoing the algorithm and validating key results.
The results may slightly differ from published numbers, due to the randomness in Java's HashSet and Python set's iteration order. The difference should be within 0.5%.
A major flaw of this system is the speed of running new sentences, due to ELMo processing. We have cached ELMo results for the provided experiments to make running experiments possible.
To this end, we are working on an online demo, and we plan to release it before EMNLP 2018.
- Minimum 16G available disk space and 16G memory. (Lower specs will not work)
- Python 3.X (Mostly tested on 3.5)
- A POSIX OS (Windows not tested)
virtualenv
if you are installing with script (check ifvirtualenv
command works)wget
if you are installing with script (Use brew to install it on OSX)unzip
if you are installing with script
To make everyone's life easier, we have provided a simple way for install, simply run sh install.sh
.
This script does everything mentioned in the next section, plus creating a virtualenv. Use source venv/bin/activate
to activate.
Generally it's recommended to create a Python3 virtualenv and work under it.
You need to first install AllenAI's bilm-tf package by running python3 setup.py install
in ./bilm-tf directory
Then install requirements by pip3 install -r requirements.txt
in project root
Then you need to download all the data/model files. There are two steps in this:
Then check if all files are here by python3 scripts.py CHECKFILES
or python3 scripts.py CHECKFILES figer
in order to check figer caches etc.
Currently you can do the following:
- Run experiment on FIGER test set (randomly sampled as the paper):
python3 main.py figer
- Run experiment on BBN test set:
python3 main.py bbn
- Run experiment on the first 1000 Ontonotes_fine test set instances (due to size issue):
python3 main.py ontonotes
It's generally an expensive operation to run on new sentences, but you can still do it.
Please refer to main.py
to see how you can test on your own data.
The package is composed with
- A slightly modified ELMo source code, see bilm-tf
- A main library
zoe_utils.py
- A executor
main.py
- A script helper
script.py
This is the main library file which contains the core logic.
It has 4 main component Classes:
Supports all operations related to ESA and its data files.
A main entrance is EsaProcessor.get_candidates
which given a sentence, returns
the top EsaProcessor.RETURN_NUM
candidate Wikipedia concepts
Supports all operations related to ElMo and its data files.
A main entrance is ElmoProcessor.rank_candidates
, which given a sentence and a list
of candidates (generated from ESA), rank them by ELMo representation cosine similarities. (see paper)
It will return the top ElmoProcessor.RANKED_RETURN_NUM
candidates.
This is the core engine that does inference given outputs from the previous processors.
The logic behind it is as described in the paper and is rather complicated.
One main entrance is InferenceProcessor.inference
which receives a sentence, outputs from
previously mentioned processors, and set inference results.
This evaluates performances and print them, after given a list of sentences processed by
InferenceProcessor
Initialize this with a data file path. It reads standard json formats (see examples)
and transform the data into a list of Sentence
See the following paper:
@inproceedings{ZKTR18,
author = {Ben Zhou, Daniel Khashabi, Chen-Tse Tsai and Dan Roth },
title = {Zero-Shot Open Entity Typing as Type-Compatible Grounding},
booktitle = {EMNLP},
year = {2018},
}