Giter Site home page Giter Site logo

bigai-nlco / docgnre Goto Github PK

View Code? Open in Web Editor NEW
8.0 1.0 0.0 50 KB

[EMNLP 2023] Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models

License: MIT License

Python 97.33% Shell 2.67%
chatgpt distant-supervision document-relation-extraction emnlp2023 llms

docgnre's Introduction

DocGNRE

This repo contains the code used for the EMNLP 2023 paper "Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models".

Requirements

  • Python 3.8
  • Python packages
    • PyTorch 2.0+
    • transformers 4.24.0
    • openai
    • tqdm
    • wandb
    • pandas

Datasets

Original DocRED and Re-DocRED

The DocRED and Re-DocRED dataset can be downloaded following the instructions at the corresponding links.

DocGNRE

Our enhanced dataset can be obtained with the following command:

wget https://bigai-nlco.s3.ap-southeast-1.amazonaws.com/DocGNRE/enhancement_data.zip

We provide an enhanced test set (in "enhancement_data/re_docred_test_data_enhancement.json") after manually refining. We also provide four training datasets enhanced by our distant annotations.

Automatic Relation Generation

You can run one command to automatically generate distantly enhanced datasets.

bash Automatical_Relation_Generation/run.sh

GPT Results as Proposals

STEP1
The Automatical_Relation_Generation/I_gpt_proposal.py script generates additional triples for each document in the original dataset.
This code requires OpenAI's model APIs. Accessing the API requires an API key, which you can obtain by creating an account and going to the official website.
Example to run (cd Automatical_Relation_Generation):

python I_gpt_proposal.py -i ${input_file_path} -o ${output_file_path}

Arguments:

  • -i, --input: Path to the input file, such as the train file path of Re-DocRED.
  • -o, --output: Path to the output file.

The Automatical_Relation_Generation/I_gpt_proposal_more.py script generates more additional triples through an iterative approach by feeding the previous GPT answers as input.

STEP2
The Automatical_Relation_Generation/II_gpt_triples_postprocess.py script filters undesired or illegal triples.
Example to run:

python II_gpt_triples_postprocess.py -i ${step1_output_file_path} -o ${output_file_path}

Arguments:

  • -i, --input: Path to the input file.
  • -o, --output: Path to the output file.

NLI as an Annotator

STEP3
The Automatical_Relation_Generation/III_nli_annotator.py script calculates the entailment scores used for predefined relation types.
Example to run:

python III_nli_annotator.py -i ${step2_output_file_path} -o ${output_file_path}

Arguments:

  • -i, --input: Path to the input file.
  • -o, --output: Path to the output file.

STEP4
The Automatical_Relation_Generation/IV_nli_score_postprocess.py script supplements relations according to entailment scores to ensure the high quality of newly added triples.
Example to run:

python IV_nli_score_postprocess.py -origin ${step1_input_file_path} -i ${step3_output_file_path} -o ${output_file_path}

Arguments:

  • -origin, --origin: Path to the original file.
  • -i, --input: Path to the input file.
  • -o, --output: Path to the output file.

DocRE Models

Training

The codebase of this repo is extended from DREEAM. This work mainly designs an automated annotation method, so there is basically no difference between model training and evaluation. Just change the training set file in the DocRE/scripts/run_bert_gpt.sh and DocRE/scripts/run_roberta_gpt.sh to complete the training.
Run below:

bash scripts/run_bert_gpt.sh ${name} ${lambda} ${seed} # for BERT
bash scripts/run_roberta_gpt.sh ${name} ${lambda} ${seed} # for RoBERTa

where ${name} is the identifier of this run displayed in wandb, ${lambda} is the scaler that controls the weight of evidence loss, and ${seed} is the value of random seed.

Evaluation

Make predictions on the enhanced test set with the commands below:

bash DocRE/scripts/isf_bert.sh ${name} ${model_dir} ${test_file_path} # for BERT
bash DocRE/scripts/isf_roberta.sh ${name} ${model_dir} ${test_file_path} # for RoBERTa

where ${model_dir} is the directory that contains the checkpoint we are going to evaluate. The program will generate a test file result.json in the official evaluation format.

Data Format

Generated input example:

{
  "title": "Culrav", 
  "sents": [
      ["Culrav", "is", "a", "cultural", "festival", "of", "Motilal", "Nehru", "National", ...], 
      ["Culrav", "gives", "a", "platform", "to", "the", "students", "of", "MNNIT", ...], 
      ...],
  "vertexSet": [
      [
        {
          "sent_id": 0, 
          "type": "MISC", 
          "pos": [0, 1], 
          "name": "Culrav", 
          "global_pos": [0, 0], 
          "index": "0_0"
        }, 
        {
          "sent_id": 1, 
          "type": "MISC", 
          "pos": [0, 1], 
          "name": "Culrav", 
          "global_pos": [15, 15], 
          "index": "0_1"},
        ...],
      ...],
  "labels": [
      {
        "r": "P17",
        "h": 0, 
        "t": 3, 
        "evidence": [0, 1, 3]
      }, 
      {
        "r": "P131",
        "h": 1, 
        "t": 2, 
        "evidence": [0]
      },
      ...],
  "gpt_labels": [
      {
        "h": 0,
        "r": "P127",
        "t": 1, 
        "score": 0.758
      }, 
      {
        "h": 0,
        "r": "P276", 
        "t": 4, 
        "score": 0.662
      },
      ...]
}

Citation

@inproceedings{docgnre,
    title = "Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models",
    author = "Li, Junpeng and Jia, Zixia and Zheng, Zilong",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    publisher = "Association for Computational Linguistics"
}

Acknowledgements

The codebase of this repo is extended from DREEAM.

docgnre's People

Contributors

jzxxx avatar lijunpeng2022 avatar zilongzheng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

docgnre's Issues

Question About Generating Proposals

Thank you for your outstanding work. I have the following two questions:

  1. Why does I_gpt_proposal.py use relationship names in the input prompt, while I_gpt_proposal_more.py does not? Could this potentially lead to biases in the evaluation?

  2. Could you please provide the results from I_gpt_proposal.py and I_gpt_proposal_more.py, i.e., the raw output generated by GPT?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.