Giter Site home page Giter Site logo

scitldr's Introduction

SciTLDR

This repository contains the dataset, model weights, and generation code for our paper "TLDR: Extreme Summarization of Scientific Documents".

Demo

A running demo of our model can be found here.

Dataset

SciTLDR is split in to a 60/20/20 train/dev/test split. For each file, each line is a json, formatted as follows

{
   "source":[
      "sent0",
      "sent1",
      "sent2",
      ...
   ],
   "source_labels":[binary list in which 1 is the oracle sentence],
   "rouge_scores":[precomputed rouge-1 scores],
   "paper_id":"PAPER-ID",
   "target":[
     "author-tldr",
      "pr-tldr0", 
      "pr-tldr1",
      ... 
   ],
   "title":"TITLE"
}

The keys rouge_scores and source_labels are not necessary for any code to run, but we provide precomputed Rouge scores to encourage future research.

Requirements

We use Fairseq to train and evaluate our models. Install Fairseq as follows:

git clone fairseq repo #TODO figure out how to use specific version
cd fairseq
pip install --editable .

To install all other requirements, run pip install -r requirements.txt

For the evaluation, you will need files2rouge. Please follow the installation instructions here.

Model Weights

catts.tldr-ao

catts.tldr-aic

catts-xsum.tldr-ao

catts-xsum.tldr-aic

bart.tldr-ao

bart.tldr-aic

bart-xsum.tldr-ao

bart-xsum.tldr-aic

Data Preprocessing

In order to format the data to work for the Fairseq library, run:

cd SciTLDR-Data
export TASK=SciTLDR-A # Choose from {A, AIC, FullText}
chmod +x make_datafiles.sh
./make_datafiles.sh # BPE preprocess

$TASK/ctrl contains the dataset formatted with the control codes.

Generation

This code takes in a test.source file, in which each line is an input and outputs a test.hypo file with the predictions. See decoder_params for optimal decoder parameters for each version of the model.

python scripts/generate.py /path/to/modeldir/ SciTLDR-Data/SciTLDR-A/ctrl ./ --beam 2 --lenpen 0.4 --test_fname test.hypo

Evaluation

This script is a wrapper around ROUGE that takes in a test.hypo file and compares to a test.jsonl file.

python scripts/cal-rouge.py /path/to/test.hypo SciTLDR-Data/SciTLDR-A/test.jsonl --workers 1

Citing

If you use our code, dataset, or model weights in your research, please cite "TLDR: Extreme Summarization of Scientific Documents."

@article{cachola2020tldr,
  title={{TLDR}: Extreme Summarization of Scientific Documents},
  author={Isabel Cachola and Kyle Lo and Arman Cohan and Daniel S. Weld},
  journal={arXiv:2004.15011},
  year={2020},
}

SciTLDR is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

scitldr's People

Contributors

isabelcachola avatar kyleclo avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.