epfl-dlab / genie Goto Github PK

The autoregressive information extraction system GenIE (Generative Information Extraction) implemented in PyTorch.

License: MIT License

Python 82.28% Shell 1.84% Jupyter Notebook 15.88%

genie's Introduction

GenIE: Generative Information Extraction

This repository contains a PyTorch implementation of the autoregressive information extraction system GenIE proposed in the paper GenIE: Generative Information Extraction. We extend these ideas in our follow-up work on SynthIE, visit this link for details.

@inproceedings{josifoski-etal-2022-genie,
    title = "{G}en{IE}: Generative Information Extraction",
    author = "Josifoski, Martin  and
      De Cao, Nicola  and
      Peyrard, Maxime  and
      Petroni, Fabio  and
      West, Robert",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.342",
    doi = "10.18653/v1/2022.naacl-main.342",
    pages = "4626--4643",
}

Please consider citing our work, if you found the provided resources useful.

GenIE in a Nutshell

GenIE uses a sequence-to-sequence model that takes unstructured text as input and autoregressively generates a structured semantic representation of the information expressed in it, in the form of (subject, relation, object) triplets, as output. GenIE employs constrained beam search with: (i) a high-level, structural constraint which asserts that the output corresponds to a set of triplets; (ii) lower-level, validity constraints which use prefix tries to force the model to only generate valid entity or relation identifiers (from a predefined schema). Here is an illustration of the generation process for a given example:

Our experiments show that GenIE achieves state-of-the-art performance on the taks of closed information extraction, generalizes from fewer training data points than baselines, and scales to a previously unmanageable number of entities and relations.

Dependencies

To install the dependencies needed to execute the code in this repository run:

bash setup.sh

Usage Instructions & Examples

The demo notebook provides a full review of how to download and use GenIE's functionalities, as well as the additional data resources.

Training & Evaluation

Training

Each of the provided models (see demo) is associated with a Hydra configuration file that reproduces the training. For instance, to run the training for the genie_r model run:

MODEL_NAME=genie_r
python run.py experiment=$MODEL_NAME

Evaluation

Hydra provides a clean interface for evaluation. You just need to specify the checkpoint that needs to be evaluated, the dataset to evaluate it on, and the constraints to be enforced (or not) during generation:

PATH_TO_CKPT=<path_to_the_checkpoint_to_be_evaluated>

# The name of the dataset (e.g. "rebel", "fewrel", "wiki_nre", "geo_nre")
DATASET_NAME=rebel  # rebel, fewrel, wiki_nre or geo_nre

# The constraints to be applied ("null" -> unconstrained, "small" or "large"; see the paper or the demo for details)
CONSTRAINTS=large

python run.py run_name=genie_r_rebel +evaluation=checkpoint_$CONSTRAINTS datamodule=$DATASET_NAME model.checkpoint_path=$PATH_TO_CKPT

To run the evaluation in a distributed fashion (e.g. with 4 GPUs on a single machine) add the option trainer=ddp trainer.gpus=4 to the call.

From here, to generate the plots and the bootstrapped results reported in the paper run python run.py +evaluation=results_full. See the configuration file for details.

License

This project is licensed under the terms of the MIT license.

genie's People

Contributors

Stargazers

Watchers

Forkers

ii-research-yu fmchowdhury daggerfall-is-the-best-tes-game chenxwh andreiccoman frankverhoef vu1seek khanhnguyen15 protocol-streams dave-shore multipolarityai

genie's Issues

For the aNLG component, what ROUGE metrics are used to evaluate it?

I found the aNLG component leaderboard from AI2: https://leaderboard.allenai.org/genie-anlg/submissions/public
They are using ROUGE to evaluate the performance, however, I fail to find which variant of ROUGE is used. Any idea?

`protobuf` package version mismatch

Error Message

TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

The package version of protobuf installed by default is 4.22.0 in my case, a fix is to downgrade the package into 3.20.x.
We could also consider specifying the version in the requirements.txt

Reproduction

Running setup.py
Start the Demo.ipynb notebook
Run the block

"""Load the Model"""
from genie.models import GeniePL

ckpt_name = "genie_r.ckpt"
path_to_checkpoint = os.path.join(DATA_DIR, 'models', ckpt_name)
model = GeniePL.load_from_checkpoint(checkpoint_path=path_to_checkpoint)

Fix

Explicitly set protobuf package version in the requirements.txt: protobuf==3.20

There may have better ways to fix the error.

License information missing

May I ask what is the license of GenIE codes and also data?

how was `martinjosifoski/genie-rw` tokenizer obtained?

Hi @martinj96,

I have a question regarding the random initialisation (which is the default right?) It is not clear to me here how

tokenizer = transformers.BartTokenizer.from_pretrained("martinjosifoski/genie-rw")

is trained/obtained beforehand?

Thank you!

I could not find .utils.py or folder

Hi, I'm trying to use your code.
To exercise with your demo code, there's a code "from .utils import label_smoothed_nll_loss".
But I could not find that .utils.
Where can I find it?
Thank you for read my issue.

Broken link for custom prefix tree construction

Hello,

The link to the custom prefix tree construction cell seems to be broken in these two cells in demo.ipynb:
To construct a prefix trie for your custom set of strings see this section.
and
The last two examples illustrate how the generation for any of the GenIE models can be constrained with an arbitrary prefix tries. See how you can construct your custom prefix trie

Is it possible to adapt genie to DBpedia?

Hi, I am working on a tool to translate text to RDF format under the DBpedia ontology. I think your tool is great and I would like to use it in my project, but I have seen that you use the wikidata ontology.

Do you think it is possible to use another ontology? what would be needed?
I believe that DBpedia Spotlight does something similar to what GENRE does

Thanks in advance

Error with pip install -r pip_requirements.txt

Running bash setup.sh will raise the following error.

ERROR conda.cli.main_run:execute(32): Subprocess for 'conda run ['pip', 'install', '-r', 'pip_requirements.txt']' command failed.  (See above for error)
Collecting numpy==1.20.3
  Downloading numpy-1.20.3-cp38-cp38-macosx_10_9_x86_64.whl (16.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.0/16.0 MB 11.6 MB/s eta 0:00:00
Collecting jsonlines==2.0.0
  Downloading jsonlines-2.0.0-py3-none-any.whl (6.3 kB)

ERROR: Could not find a version that satisfies the requirement pytorch==1.8.0 (from versions: 0.1.2, 1.0.2)
ERROR: No matching distribution found for pytorch==1.8.0

The error was raised because the package name for pytorch is torch rather than pytorch.

A simple fix is replacing pytorch==1.8.0 with torch==1.8.0

Unable to reproduce GenIE/notebooks/Demo.ipynb output

Hello, thank you for your valuable work. I found it very interesting!

I have tried to run the GenIE/notebooks/Demo.ipynb notebook and I found some mismatches with the provided outputs. I was wondering if you have any idea of why this is happening.

For instance, under the Unconstrained Generation subsection, I get the following output:

[[{'text': ' KSAZ <rel> headquarters location <obj> Phoenix, Arizona <et>', 'log_prob': -0.1369926631450653}, {'text': ' KSAZ-TV <rel> headquarters location <obj> Phoenix, Arizona <et>', 'log_prob': -0.200978085398674}]]

while the expected one is:

[[{'text': ' KTRK, Carson <rel> headquarters location <obj> Phoenix, Arizona <et>', 'log_prob': -0.19589225947856903}, {'text': ' KTRK, Carson <rel> located in the administrative territorial entity <obj> Arizona <et> KSAZ <rel> headquarters location <obj> Phoenix, Arizona <et>', 'log_prob': -0.2037668377161026}]]

Same behaviour under the Constrained Generation subsection. For instance, the Small Schema Constrainted Generation output is:

[[{'text': ' Arizona <rel> capital <obj> Phoenix, Arizona <et>', 'log_prob': -0.21632088720798492}, {'text': ' Phoenix, Arizona <rel> capital of <obj> Arizona <et> Arizona <rel> capital <obj> Phoenix, Arizona <et>', 'log_prob': -0.3067542612552643}]]

while the expected one is:

[[{'text': ' Fox Broadcasting Company <rel> located in the administrative territorial entity <obj> Arizona <et> Phoenix, Arizona <rel> capital of <obj> Arizona <et> Arizona <rel> capital <obj> Phoenix, Arizona <et>', 'log_prob': -0.43319371342658997}, {'text': ' Fox Broadcasting Company <rel> headquarters location <obj> Arizona <et> Phoenix, Arizona <rel> capital of <obj> Arizona <et> Arizona <rel> capital <obj> Phoenix, Arizona <et>', 'log_prob': -0.4518451988697052}]]

Similarly, the output under the Large Schema Constrainted Generation is:

while the expected one is:

[[{'text': ' KTRK <rel> headquarters location <obj> Phoenix, Arizona <et> KSAZ-TV <rel> headquarters location <obj> Phoenix, Arizona <et>', 'log_prob': -0.22215303778648376}, {'text': ' KTRK <rel> headquarters location <obj> Phoenix, Arizona <et>', 'log_prob': -0.22950957715511322}]]

PS: I also had to downgrade torchmetrics to 0.6.0 as the default conda installation through the provided bash.sh script threw the following ImportError:

ImportError: cannot import name 'get_num_classes' from 'torchmetrics.utilities.data'

epfl-dlab / genie Goto Github PK

genie's Introduction

GenIE: Generative Information Extraction

GenIE in a Nutshell

Dependencies

Usage Instructions & Examples

Training & Evaluation

Training

Evaluation

License

genie's People

Contributors

Stargazers

Watchers

Forkers

genie's Issues

Error Message

Reproduction

Fix

Recommend Projects

Recommend Topics

Recommend Org