Giter Site home page Giter Site logo

jpwahle / emnlp23-paraphrase-types Goto Github PK

View Code? Open in Web Editor NEW
11.0 1.0 2.0 304 KB

The official implementation of the EMNLP 2023 paper "Paraphrase Types for Generation and Detection"

Home Page: https://aclanthology.org/2023.emnlp-main.746/

License: Apache License 2.0

Python 98.52% Shell 1.48%
nlp paraphrase chatgpt llama types

emnlp23-paraphrase-types's Introduction

Paraphrase Types for Generation and Detection

arXiv HuggingFace Spaces HuggingFace Datasets

teaser

The Repository

This repository implements the EMNLP'23 paper "Paraphrase Types for Generation and Detection".

Demo

A demonstration for Paraphrase Type Generation with an interactive chat window can be found on HuggingFace Spaces.

Data

The preprocessed ETPC dataset with paraphrase types can be found on HuggingFace Datasets. Data card and loading scripts are under etpc/.

Fine-Tuning

Fine-tune generation models

You can use the src/finetune_generation.py script to train the generation models. Here is an example of how to use it:

python3 src/finetune_generation.py --model_nane <model_name> --task_name <task_name> --device <device>

Replace <model_name>, <task_name>, and <device> with your specific values.

  • <model_name>: The name of the pre-trained model on HF.
  • <task_name>: Paraphrase Type Generation or regular Paraphrase Generation.
  • <device>: CUDA, CPU, or MPS (for Apple Silicon)

For more details on the parameters, refer to the script src/finetune_generation.py.

Fine-tune detection models

You can use the src/finetune_detection.py script to train the detection models. Here is an example of how to use it:

python3 src/finetune_detection.py --model_nane <model_name> --task_name <task_name> --device <device>

Replace <model_name>, <task_name>, and <device> with your specific values.

  • <model_name>: The name of the pre-trained model on HF.
  • <task_name>: Paraphrase Type Detection or regular Paraphrase Detection.
  • <device>: CUDA, CPU, or MPS (for Apple Silicon)

For more details on the parameters, refer to the script src/finetune_detection.py.

Slurm

If you are using a slurm cluster for managing resources, see slurm_cls.sh and slurm_gen.sh.

Prompt-based learning with LLMs

To generate prompts for both type generation and detection, execute src/generate_prompts_etpc.py. This will create four files: detection_train.jsonl, detection_test.jsonl, generation_train.jsonl, and generation_test.jsonl. These files are used for training and testing detection and generation respectively. You can generate prompts for QQP analogous using src/generate_prompts_qqp.py.

LLaMA

Update 16-12-2023: We have now also fine-tuned LLaMA 2 models (with PEFT / LORA adapters), which can be found below.

Model Params Dataset Task Link
LLaMA 2 7B ETPC PTG llama-7b-etpc
LLaMA 2 13B ETPC PTG llama-13b-etpc
LLaMA 2 70B ETPC PTG llama-70b-etpc
LLaMA 2 7B QQP PD llama-7b-qqp
LLaMA 2 13B QQP PD llama-13b-qqp
LLaMA 2 70B QQP PD llama-70b-qqp

PTG = Paraphrase Type Generation, PD = Paraphrase Detection

To run LLaMA, execute src/llama_generation.py or src/llama_detection.py.

python3 -m torch.distributed.run --nproc_per_node 8 src/llama_generation.py --ckpt_dir <ckpt_dir> --tokenizer_path <tokenizer_path> --data_file <data_file>

Replace <ckpt_dir>, <tokenizer_path>, <dataset_name>, and <params> with your specific values.

  • <ckpt_dir>: The directory where the model checkpoints are stored after downloading from the LLaMA repo.
  • <tokenizer_path>: The path to the tokenizer used by the model.
  • <data_file>: The file containing prompts and completions.

For running LLaMA with slurm, use slurm_llama_gen.sh and slurm_llama_cls.sh.

To finetune LLaMA, follow instructions here. You can load the fine-tuned model with <ckpt_dir> to compare to the prompted model. Under src/llama_transfer.py, you can test the prompted and fine-tuned model on other paraphrase tasks (e.g., PAWS).

ChatGPT

To fine-tune ChatGPT-3.5, execute src/finetune_chatgpt.py. Specify either the detection_train.jsonl or generation_train.jsonl file that was generated using the generate_prompts_* scripts.

Evaluating the fine-tuned model on paraphrase type generation and detection can be achieved by running src/eval_type_detection_chatgpt.py and src/eval_generation_chatgpt.py and providing the <model_id> of the finetuned model and the <data_file> which can be generation_test.jsonl or detection_test.jsonl.

To evaluate on qqp, run src/eval_detection_chatgpt.py and use src/eval_generation_chatgpt.py with the other generated prompt files.

Contributing

There are many ways in which you can participate in this project, for example:

Citation

@inproceedings{wahle-etal-2023-paraphrase,
    title = "Paraphrase Types for Generation and Detection",
    author = "Wahle, Jan Philip  and
      Gipp, Bela  and
      Ruas, Terry",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.746",
    doi = "10.18653/v1/2023.emnlp-main.746",
    pages = "12148--12164",
    abstract = "Current approaches in paraphrase generation and detection heavily rely on a single general similarity score, ignoring the intricate linguistic properties of language. This paper introduces two new tasks to address this shortcoming by considering paraphrase types - specific linguistic perturbations at particular text positions. We name these tasks Paraphrase Type Generation and Paraphrase Type Detection. Our results suggest that while current techniques perform well in a binary classification scenario, i.e., paraphrased or not, the inclusion of fine-grained paraphrase types poses a significant challenge. While most approaches are good at generating and detecting general semantic similar content, they fail to understand the intrinsic linguistic variables they manipulate. Models trained in generating and identifying paraphrase types also show improvements in tasks without them. In addition, scaling these models further improves their ability to understand paraphrase types. We believe paraphrase types can unlock a new paradigm for developing paraphrase models and solving tasks in the future.",
}

If you use the ETPC dataset, please also cite:

@inproceedings{kovatchev-etal-2018-etpc,
    title = "{ETPC} - A Paraphrase Identification Corpus Annotated with Extended Paraphrase Typology and Negation",
    author = "Kovatchev, Venelin  and
      Mart{\'\i}, M. Ant{\`o}nia  and
      Salam{\'o}, Maria",
    booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
    month = may,
    year = "2018",
    address = "Miyazaki, Japan",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://aclanthology.org/L18-1221",
}

License

Licensed under the Apache 2.0 license. Parts of the code under src/llama are licensed under the LLaMA Community License Agreement

emnlp23-paraphrase-types's People

Contributors

jpwahle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

callanwu yaskh

emnlp23-paraphrase-types's Issues

not working as expected

`from transformers import AutoTokenizer,AutoModelForCausalLM

import transformers
import torch
import os

cwd = os.getcwd()
model_id = "daryl149/llama-2-7b-chat-hf"
peft_model_id = os.path.join(cwd, 'llama-7b-qqp')

print("Starting..")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, token="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")
def generate_paraphrases(text, num_return_sequences=3):
prompted_text = f" {text}"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, num_return_sequences=num_return_sequences, max_length=100, do_sample=True, top_k=50, top_p=0.95, temperature=1.2)
# return "Paraphrased- " + outputs
return tokenizer.batch_decode(outputs)
from peft import PeftConfig

peft_model_id = os.path.join(cwd, 'llama-7b-qqp')

text = "Hello"
inputs = tokenizer(text, return_tensors="pt")

peft_config = PeftConfig.from_pretrained(peft_model_id)
model.add_adapter(peft_config)
model.enable_adapters()
test= "Here are two questions (Question1 and Question2). If these questions have the same meaning and same answer, answer 'Yes', otherwise 'No'.\nQuestion1: 2+3?, Question2: 3+2"

generate_paraphrases("[INST]Use the given question to guide your summary about the context. If you don’t know the answer, just say that you don’t know, don’t try to make up an answer.Question: Summarize Topic XContext:When talking about Topic X, Scenario Y is always referred to. This is due to the relation of Topic X is a broad topic which covers many aspects of life. No one knows when Topic X became a thing, its origin is unknown even to this day.[/INST]")

generate_paraphrases(test,1)`

am i missing something?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.