On the Summarization and Evaluation of Long Documents

This file documents the steps required to finetune a model, generate summaries and then run evaluation on these predictions.

Prerequisites

Environment

python >= 3.6, GPU >= 12Gb, tested on Linux only

Create and activate a new virtual environment
cd to the root of this directory
Run the following command to install packages: sh install_packages.sh

Data

CNN/DailyMail: Download instructions here
PubMed & arXiv: Download instructions here
Quora Question Pairs: Download instructions here
Annotated CNN/DailyMail dataset: Download instructions here

Check install has worked correctly

A demo script is provided in example. cd here and run sh run_example.sh to run this pipeline. Instructions included within this file.

Evaluation Analysis

Eval metrics

In project we test the relative merits of a set of eval metrics. These metrics are as follows and will here on in be referred to as the metrics.

Human-Metric Correlations

The file here is: scripts/benchmarking/get_corrs.py. This is based heavily on this file. The objective here is to compute the correlation between human judgement and the metric eval scores using 2.5K human scores across 500 CNN/DailyMail summaries. The data for this task is in: scripts/benchmarking/learned_eval/data. Example usage:

cd scripts/benchmarking/
python get_corrs.py -m ROUGE-1-F

The results in section 6.1.2 can be replicated by performing this for each evaluation metric.

Quora Question Pairs Analysis

The file here is: scripts/benchmarking/qqp_corrs.py. This is also based heavily on this file. The objective here is to distingush whether two sentences are semantically equivalent or not. The path to the data can be set in scripts/benchmarking/resources.py. Example usage:

cd scripts/benchmarking/
python qqp_corrs.py

Adversarial Analysis

The file here is: scripts/benchmarking/adversarial.py. The objective here is to test which of the eval metrics are most sensitive to artificial corruption when applied to summaries. We performed this analysis using PEGASUS as these have been shown to be of human quality. This process involves 2 steps:

Corrupt the summaries using 3 different sources of noise:

Random word dropping
Random word permutation
BERT mask-filling (mask random words then use BERT to in-fill these)

Perform evaluation using PEGASUS's original and corrupted summaries using the (human-produced) target summaries as the ground truth.

To perform step 1):

cd scripts/benchmarking/
Open adversarial.py and set the SUMMS_PATH (the path of the summaries to corrupt) and OUT_DIR (where the corrupted summaries should be saved)
Run python adversarial.py

To perform step 2):

Follow the instructions in Step 3. Running evaluation to score each summary.
Run the AdversarialAnalyzer() class in adversarial.py to compute the accuracy of each metric. You will have to manually add the paths to the original and corrupted summaries in this class (lines 145-158)

Architecures Analysis

Replicating LED Results This is a multi-stage pipeline as each step is GPU intensive. The steps in the pipeline are as follows:

Finetune a model
Generate the test set predictions (summaries)
Evaluate the test summaries

A demo script is provided in example. cd here and run sh run_example.sh to run this pipeline

Step 1. Finetune the LED

The main file is scrips/models/finetune.py. This is based heavily on the Huggingface script and this is a useful resource if you have any difficulties here. There are a number of shell scripts in sh_scripts/ which are configured for many of the tasks you will be interested in with this repo. To finetine the LED to replicate results from the thesis use the following steps:

cd to scripts/models # TODO: check path
Open sh_scripts/finetune_led.sh
Change DATA_DIR and OUT_DIR to point to where the data is stored and where you want results to be saved to
Run sh_scripts/finetune_led.sh

This script takes approx 40hrs to run per epoch. This is because we used batch size=1 to fit onto a 12Gb GPU. If you have larger hardware this can be sped-up considerably. The best version of the model will be saved in $OUT_DIR/best_tfmr and this will be used to generate the test set summaries in the next step.

Step 2. Generate test set summaries

Here you also use scrips/models/finetune.py. This step should be run after finetuning as it requires a saved model in $OUT_DIR/best_tfmr.

cd to scripts/models # TODO: check path
Open sh_scripts/predict.sh
Change DATA_DIR and OUT_DIR to point to where the data is stored and where you want results to be saved to. $OUT_DIR should point to the location of the saved model you want to generate predictions using
Run sh_scripts/predict.sh

This will generate 2 files within $OUT_DIR: 1) test_generations.txt and test_targets.txt. These contain the generated summaries from the test set and the targets respectively. These files will be used for the evaluation stage.

Step 3. Running evaluation

Here you use scripts/benchmarking/benchmark.py. The purpose of this step is to generate evaluation scores using each of eval metrics.

Process:

cd to scripts/benchmarking
Open sh_scripts/run_eval.sh
Change SUMS_DIR and OUTDIR appropriately. OUTDIR is where the eval output will be saved to. SUMS_DIR is where the dir containing the generated summaries along with the targets. These should be named according to Step 2. above.
Run sh_scripts/run_eval.sh

This will save the output to $OUTDIR/eval_output_raw.txt and $OUTDIR/eval_output.txt. The first contains the scores for each metric per summary; the second contains the means and standard deviations only. These are saved as a single line per directory- if you want to perform eval for a different model it will append these scores to a new line.

azraar / nlp_summarization Goto Github PK

nlp_summarization's Introduction

On the Summarization and Evaluation of Long Documents

Prerequisites

Environment

Data

Check install has worked correctly

Evaluation Analysis

Eval metrics

Human-Metric Correlations

Quora Question Pairs Analysis

Adversarial Analysis

Architecures Analysis

Step 1. Finetune the LED

Step 2. Generate test set summaries

Step 3. Running evaluation

nlp_summarization's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent