This project extends DyGIE++, a general IE framework, to handle document-level IE tasks. Specifically, it establishes benchmarks for end-to-end document-level relation extraction on three well-known datasets: DocRED, CDR, and GDA. A demo of this project can be found here.
- Dependencies
- Model training
- Model evaluation
- Pretrained models
- Making predictions on existing datasets
- Contact
See the doc
folder for documentation with more details on the data, model implementation and debugging, and model configuration.
Clone this repository and navigate the the root of the repo on your system. Then execute:
conda create --name dygiepp python=3.7
pip install -r requirements.txt
conda develop . # Adds DyGIE to your PYTHONPATH
Similar to DyGIE++, this codebase also relies on AllenNLP and uses AllenNLP shell commands to for training models and making predictions.
If you run into an issue installing jsonnet
, this issue may prove helpful.
The training procedure is exactly the same as the original DyGIE++. Please refer to this section for details. In short, to train a model, simply run:
bash scripts/train.sh CONFIG
, where CONFIG.jsonnet
is the configuration file defined under the training_config
directory.
The original DocRED dataset did not release the test set annotation. Therefore, we split the dev set into devdev
and devtest
sets. The corresponding dockeys for each devdev
and devtest
are defined in scripts/data/docred/dev-dev_dockey.list
and scripts/data/docred/dev-test_dockey.list
.
- Download the data. From the top-level folder for this repo, run
bash ./scripts/data/get_docred.sh
. - Preprocess the data. Run
bash ./scripts/data/docred/process_docred.sh
. This will produce mention-level data*.json
for training DyGIE++ as well as entity-level data*.entlvl.json
for final evaluation indata/docred/processed-data
. - Train the model. Run
bash scripts/train.sh docred
. The default encoder isbert-base-cased
, but you can easily replace thebert-base-cased
withroberta-base
for slightly better performance.
The data pre-processing and model training for CDR & GDA are mostly the same, and both of them very simlar to DocRED's. Below shows how CDR data is pre-process and how a model can be trained on CDR.
- Download the data. From the top-level folder for this repo, enter
bash ./scripts/data/get_cdr.sh
. - Train the model. Enter
bash scripts/train cdr
. - As with SciERC, we also offer a "lightweight" version with a context width of 1 and no coreference propagation.
To evaluate sentence-level IE tasks, such as NER, relation extraction and event extraction, please refer to how DyGIE++ evaluate a model. For end-to-end document-level relation extraction, it's currently implemented as a separate evaluator in eval.py
. You need to pass the entity-level predictions from the model and the entity-level
gold data to the script as follows:
python eval.py --pred-file PATH_TO_PREDICTON_FILE.entlvl.json --gold-file PATH_TO_GOLD_FILE.entlvl.json
Note that passing mention-level predictions (output from allennlp predict
) to the evaluator would not work.
Below describes how you make predictions.
A number of models are available for download. They are named for the dataset they are trained on. "Lightweight" models are models trained on datasets for which coreference resolution annotations were available, but we didn't use them. This is "lightweight" because coreference resolution is expensive, since it requires predicting cross-sentence relationships between spans.
If you want to use one of these pretrained models to make predictions on a new dataset, you need to set the dataset
field for the instances in your new dataset to match the name of the dataset
the model was trained on. For example, to make predictions using the pretrained SciERC model, set the dataset
field in your new instances to scierc
. For more information on the dataset
field, see data.md.
To download all available models, run scripts/pretrained/get_dygiepp_pretrained.sh
. Or, click on the links below to download only a single model.
Below are links to the available models, followed by the name of the dataset
the model was trained on.
Dataset | Entity F1 | Relation F1 |
---|---|---|
DocRED | 87.56 | 51.15 |
CDR | 79.07 | 50.90 |
GDA | 83.63 | 76.88 |
To make mention-level predictions, you can use allennlp predict
. Details are described in the original DyGIE++ repo.
Mention-level relation predictions can be gathered into entity-level relation predictions with postprocess.py
, which aggreegates mention-level relations between every pairs of entities with voting. A typical usage is:
python postprocess.py --input-path predictions/cdr.test.json --output-path predictions/cdr.test.entlvl.json
where predictions/cdr.test.json
is the output from the allennlp predict
command.
For questions or problems with the code, create a GitHub issue (preferred) or email [email protected]
.