Giter Site home page Giter Site logo

moztec-jp / corpg Goto Github PK

View Code? Open in Web Editor NEW

This project forked from l-zhe/corpg

0.0 0.0 0.0 15.65 MB

Code for paper Document-Level Paraphrase Generation with Sentence Rewriting and Reordering by Zhe Lin, Yitao Cai and Xiaojun Wan. This paper is accepted by Findings of EMNLP'21.

License: MIT License

Python 97.75% Shell 2.25%

corpg's Introduction

CoRPG

Code for paper Document-Level Paraphrase Generation with Sentence Rewriting and Reordering by Zhe Lin, Yitao Cai and Xiaojun Wan. This paper is accepted by Findings of EMNLP'21.

overview

Datasets

We leverage ParaNMT to train a sentence-level paraphrasing model. We select News-Commentary as document corpora, and we employ sentence-level paraphrasing model to generate a pseudo document-level paraphrase and use ALBERT to generate its coherence relationship graph. All these data are released at here.

System Output

If you are looking for system output and don't bother to install dependencies and train a model (or run a pre-trained model), the all-system-output folder is for you.

Dependencies

PyTorch >= 1.4

Transformers == 4.1.1

nltk == 3.5

tqdm

torch_optimizer == 0.1.0

Train a Document-Level Paraphrase Model

Step1: Prepare dataset

We release the dataset we used in data folder. If you want to use your own dataset, you need to follow the following procedure.

First, you should train a sentence-level paraphrase model to generate pseudo documen paraphrase dataset (We leverage paraNMT and fairseq to train this model).

Then, you should download the ALBERT model and fine-tuning it with your own dataset with the following script:

python eval/coherence.py --train
			 --pretrain_model [pretrain_model file]
			 --save_file [the path to save fine-tune model]
			 --text_file [the corpora used to fine-tune the pretrain_model] 

We also provide our fine-tune model in here.

Finally, you can leveraged ALBERT to generate the coherence relationship graph:

python eval/coherence.py --inference
			 --pretrain_model [pretrain_model file]
			 --text_file [generate the coherence relationship graph of this corpora]

NOTE: Our code only supports to generate the paraphrasing of documents with 5 sentences. If you want to generate longer or variable length document paraphrase, you need to make some modifications to the code.

Step2: Process dataset

Create Vocabulary:

python createVocab.py --file ./data/news-commentary/data/bpe/train.split.bpe \
                             ./data/news-commentary/data/bpe/train.pesu.split.bpe\
                      --save_path ./data/vocab.share

Processing Training Dataset:

python preprocess.py --source ./data/news-commentary/data/bpe/train.pesu.comb.bpe \
                     --graph ./data/news-commentary/data/train.pesu.graph \
                     --target ./data/news-commentary/data/bpe/train.comb.bpe \
                     --vocab ./data/vocab.share \
                     --save_file ./data/para.pair

Processing Test Dataset:

python preprocess.py --source ./data/news-commentary/data/bpe/test.comb.bpe \
                     --graph ./data/news-commentary/data/test.graph \
                     --vocab ./data/vocab.share \
                     --save_file ./data/sent.pt

Step3: Train a document-level paraphrase model

python train.py --cuda_num 0 1 2 3\
                --vocab ./data/vocab.share\
                --train_file ./data/para.pair\
                --checkpoint_path ./data/model \
                --batch_print_info 200 \
                --grad_accum 1 \
                --graph_eps 0.5 \
                --max_tokens 5000

Step4: Generate document-level paraphrase

python generator.py --cuda_num 4 \
                    --file ./data/sent.pt\
                    --ref_file ./data/news-commentary/data/test.comb \
                    --max_tokens 10000 \
                    --vocab ./data/vocab.share \
                    --decode_method greedy \
                    --beam 5 \
                    --model_path ./data/model.pkl \
                    --output_path ./data/output \
                    --max_length 300

Pre-trained Models

We release our pretrained model at here.

Evaluation Matrics

We evaluate our model in three aspects: relevancy, diversity, coherence.

Relevancy

We leverage BERTScore to evaluate the semantic relevancy between paraphrase and original sentence.

Diversity

We employ self-TER and self-WER to evaluate the diversity of our model.

Coherence

We raise COH and COH-p to evaluate the coherence of paraphrase as follows:

coh

where $$P_{SOP}$$ is calculated by ALBERT. We provide the script for these two evaluation matrics as follow:

python eval/coherence.py --coh
			 --pretrain_model [the pretrain albert file]
			 --text_file [the corpora to be evaluated]

Results

result

Citation

If you use any content of this repo for your work, please cite the following bib entry:

@inproceedings{lin-wan-2021-document,
    title = "Document-Level Paraphrase Generation with Sentence Rewriting and Reordering",
    author = "Lin, Zhe and Cai, Yitao and Wan, Xiaojun",
    booktitle = "Findings of EMNLP",
    year={2021}
}

corpg's People

Contributors

l-zhe avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.