Giter Site home page Giter Site logo

summarus's Introduction

summarus

Build Status Code Climate Gitter

Abstractive and extractive summarization models, mostly for Russian language. Building on top of AllenNLP

Based on the following papers:

Contacts

Prerequisites

pip install -r requirements.txt

Commands

train.sh

Script for training a model based on AllenNLP 'train' command.

Argument Required Description
-c true path to file with configuration
-s true path to directory where model will be saved
-t true path to train dataset
-v true path to val dataset
-r false recover from checkpoint

predict.sh

Script for model evaluation. The test dataset should have the same format as the train dataset.

Argument Required Default Description
-t true path to test dataset
-m true path to tar.gz archive with model
-p true name of Predictor
-c false 0 CUDA device
-L true Language ("ru" or "en")
-b false 32 size of a batch with test examples to run simultaneously
-M false path to meteor.jar for Meteor metric
-T false tokenize gold and predicted summaries before metrics calculation
-D false save temporary files with gold and predicted summaries

summarus.util.train_subword_model

Script for subword model training.

Argument Default Description
--train-path path to train dataset
--model-path path to directory where generated subword model will be saved
--model-type bpe type of subword model, see sentencepiece
--vocab-size 50000 size of the resulting subword model vocabulary
--config-path path to file with configuration for DatasetReader (with parse_set)

Headline generation

Dataset splits:

Models:

Prediction script:

./predict.sh -t <path_to_test_dataset> -m ria_pgn_24kk.tar.gz -p subwords_summary -L ru 

Results:

Train dataset: RIA, test dataset: RIA
Model R-1-f R-2-f R-L-f BLEU
ria_copynet_10kk 40.0 23.3 37.5 52.6
ria_pgn_24kk 42.3 25.1 39.6 54.2
ria_mbart 42.8 25.5 39.9 55.1
First Sentence 24.1 10.6 16.7 -

Train dataset: RIA, eval dataset: Lenta

Model R-1-f R-2-f R-L-f BLEU
ria_copynet_10kk 25.6 12.3 23.0 36.1
ria_pgn_24kk 26.4 12.3 24.0 39.8
ria_mbart 30.3 14.5 27.1 43.2
First Sentence 25.5 11.2 19.2 25.5

Summarization - CNN/DailyMail

Dataset splits:

Models:

Prediction script:

./predict.sh -t <path_to_test_dataset> -m cnndm_pgn_25kk.tar.gz -p words_summary -L en -R

Results:

Model R-1-f R-2-f R-L-f METEOR BLEU
cnndm_pgn_25kk 38.5 16.5 33.4 17.6 47.7

Summarization - Gazeta, russian news dataset

Models:

Prediction scripts:

./predict.sh -t <path_to_test_dataset> -m gazeta_pgn_7kk.tar.gz -p subwords_summary -L ru -T
./predict.sh -t <path_to_test_dataset> -m gazeta_summarunner_3kk.tar.gz -p subwords_summary_sentences -L ru -T

External models:

Results:

Model R-1-f R-2-f R-L-f METEOR BLEU
gazeta_pgn_7kk 29.4 12.7 24.6 21.2 38.8
gazeta_pgn_7kk_cov 29.8 12.8 25.4 22.1 40.8
gazeta_pgn_25kk 29.6 12.8 24.6 21.5 39
gazeta_pgn_words_13kk 29.4 12.6 24.4 20.9 35.9
gazeta_summarunner_3kk 31.6 13.7 27.1 26.0 46.3
gazeta_mbart 32.6 14.6 28.2 25.7 49.8
gazeta_mbart_lower 32.7 14.7 28.3 25.8 48.7

summarus's People

Contributors

ilyagusev avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.