Giter Site home page Giter Site logo

azraar / nlp_summarization Goto Github PK

View Code? Open in Web Editor NEW

This project forked from alexgaskell10/nlp_summarization

0.0 0.0 0.0 63.09 MB

Contains the code for my Imperial College London Master's thesis on text summarization

Python 93.39% Jupyter Notebook 4.73% Shell 0.26% C++ 0.93% C 0.26% Makefile 0.02% Dockerfile 0.06% CSS 0.09% JavaScript 0.27%

nlp_summarization's Introduction

On the Summarization and Evaluation of Long Documents

This file documents the steps required to finetune a model, generate summaries and then run evaluation on these predictions.

Prerequisites

Environment

python >= 3.6, GPU >= 12Gb, tested on Linux only

  1. Create and activate a new virtual environment
  2. cd to the root of this directory
  3. Run the following command to install packages: sh install_packages.sh

Data

  1. CNN/DailyMail: Download instructions here
  2. PubMed & arXiv: Download instructions here
  3. Quora Question Pairs: Download instructions here
  4. Annotated CNN/DailyMail dataset: Download instructions here

Check install has worked correctly

A demo script is provided in example. cd here and run sh run_example.sh to run this pipeline. Instructions included within this file.

Evaluation Analysis

Eval metrics

In project we test the relative merits of a set of eval metrics. These metrics are as follows and will here on in be referred to as the metrics.

Human-Metric Correlations

The file here is: scripts/benchmarking/get_corrs.py. This is based heavily on this file. The objective here is to compute the correlation between human judgement and the metric eval scores using 2.5K human scores across 500 CNN/DailyMail summaries. The data for this task is in: scripts/benchmarking/learned_eval/data. Example usage:

  • cd scripts/benchmarking/
  • python get_corrs.py -m ROUGE-1-F

The results in section 6.1.2 can be replicated by performing this for each evaluation metric.

Quora Question Pairs Analysis

The file here is: scripts/benchmarking/qqp_corrs.py. This is also based heavily on this file. The objective here is to distingush whether two sentences are semantically equivalent or not. The path to the data can be set in scripts/benchmarking/resources.py. Example usage:

  • cd scripts/benchmarking/
  • python qqp_corrs.py

Adversarial Analysis

The file here is: scripts/benchmarking/adversarial.py. The objective here is to test which of the eval metrics are most sensitive to artificial corruption when applied to summaries. We performed this analysis using PEGASUS as these have been shown to be of human quality. This process involves 2 steps:

  1. Corrupt the summaries using 3 different sources of noise:
  • Random word dropping
  • Random word permutation
  • BERT mask-filling (mask random words then use BERT to in-fill these)
  1. Perform evaluation using PEGASUS's original and corrupted summaries using the (human-produced) target summaries as the ground truth.

To perform step 1):

  • cd scripts/benchmarking/
  • Open adversarial.py and set the SUMMS_PATH (the path of the summaries to corrupt) and OUT_DIR (where the corrupted summaries should be saved)
  • Run python adversarial.py

To perform step 2):

  • Follow the instructions in Step 3. Running evaluation to score each summary.
  • Run the AdversarialAnalyzer() class in adversarial.py to compute the accuracy of each metric. You will have to manually add the paths to the original and corrupted summaries in this class (lines 145-158)

Architecures Analysis

Replicating LED Results This is a multi-stage pipeline as each step is GPU intensive. The steps in the pipeline are as follows:

  1. Finetune a model
  2. Generate the test set predictions (summaries)
  3. Evaluate the test summaries

A demo script is provided in example. cd here and run sh run_example.sh to run this pipeline

Step 1. Finetune the LED

The main file is scrips/models/finetune.py. This is based heavily on the Huggingface script and this is a useful resource if you have any difficulties here. There are a number of shell scripts in sh_scripts/ which are configured for many of the tasks you will be interested in with this repo. To finetine the LED to replicate results from the thesis use the following steps:

  1. cd to scripts/models # TODO: check path
  2. Open sh_scripts/finetune_led.sh
  3. Change DATA_DIR and OUT_DIR to point to where the data is stored and where you want results to be saved to
  4. Run sh_scripts/finetune_led.sh

This script takes approx 40hrs to run per epoch. This is because we used batch size=1 to fit onto a 12Gb GPU. If you have larger hardware this can be sped-up considerably. The best version of the model will be saved in $OUT_DIR/best_tfmr and this will be used to generate the test set summaries in the next step.

Step 2. Generate test set summaries

Here you also use scrips/models/finetune.py. This step should be run after finetuning as it requires a saved model in $OUT_DIR/best_tfmr.

  1. cd to scripts/models # TODO: check path
  2. Open sh_scripts/predict.sh
  3. Change DATA_DIR and OUT_DIR to point to where the data is stored and where you want results to be saved to. $OUT_DIR should point to the location of the saved model you want to generate predictions using
  4. Run sh_scripts/predict.sh

This will generate 2 files within $OUT_DIR: 1) test_generations.txt and test_targets.txt. These contain the generated summaries from the test set and the targets respectively. These files will be used for the evaluation stage.

Step 3. Running evaluation

Here you use scripts/benchmarking/benchmark.py. The purpose of this step is to generate evaluation scores using each of eval metrics.

Process:

  1. cd to scripts/benchmarking
  2. Open sh_scripts/run_eval.sh
  3. Change SUMS_DIR and OUTDIR appropriately. OUTDIR is where the eval output will be saved to. SUMS_DIR is where the dir containing the generated summaries along with the targets. These should be named according to Step 2. above.
  4. Run sh_scripts/run_eval.sh

This will save the output to $OUTDIR/eval_output_raw.txt and $OUTDIR/eval_output.txt. The first contains the scores for each metric per summary; the second contains the means and standard deviations only. These are saved as a single line per directory- if you want to perform eval for a different model it will append these scores to a new line.

nlp_summarization's People

Contributors

alexgaskell10 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.