Giter Site home page Giter Site logo

maks5507 / elsa Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 0.0 29.78 MB

ELSA combines extractive and abstractive approaches to the automatic text summarization

Home Page: http://elsa-nlp.com

Shell 0.15% Python 98.51% Dockerfile 1.34%
summarization fasttext bart pegasus extractive-summarization abstractive-summarization udpipe

elsa's Introduction



ELSA: Extractive Linking of Summarization Approaches

Authors: Maksim Eremeev ([email protected]), Mars Wei-Lun Huang ([email protected]), Eric Spector ([email protected]), Jeffrey Tumminia ([email protected])

Installation

python setup.py build
pip install .

Quick Start with ELSA

from elsa import Elsa

article = '''some text...
'''

abstractive_model_params = {
    'num_beams': 10,
    'max_length': 300,
    'min_length': 55,
    'no_repeat_ngram_size': 3
}

elsa = Elsa(weights=[1, 1], abstractive_base_model='bart', base_dataset='cnn', stopwords='data/stopwords.txt', 
            fasttext_model_path='datasets/cnn/elsa-fasttext-cnn.bin', 
            udpipe_model_path='data/english-ewt-ud-2.5-191206.udpipe')
            
elsa.summarize(article, **abstractive_model_params)

__init__ parameters

  • weights: List[float] -- weights for TextRank and Centroid extractive summarizations.
  • abstractive_base_model: str -- model used on the abstractive step. Either 'bart' or 'pegasus'.
  • base dataset: str -- dataset used to train the abstractive model. Either 'cnn' or 'xsum' .
  • stopwords: str -- path to the list of stopwords.
  • fasttext_model_path: str -- path to the *.bin checkpoint of a trained FastText model (see below for the training instructions).
  • udpipe_model_path: str -- path to the *.udpipe checkpoint of the pretrained UDPipe model (see data directory for the files).

summarize parameters

  • factor: float -- percentage (a number from 0 to 1) of sentences to keep in extractive summary (default: 0.5)

  • use_lemm: bool -- whether to use lemmatization on the preprocessing step (default: False)

  • use_stem: bool -- whether to use stemming on the preprocessing step (default: False)

  • check_stopwords: bool -- whether to filter stopwords on the preprocessing step (default: True)

  • check_length: bool -- whether to filter tokens shorter than 4 symbols (default: True)

  • abstractive_model_params: dict -- any parameters for the huggingface model's generate method

Datasets used for experiments

CNN-DailyMail: Link, original source: Link

XSum: Link, original source: Link

Gazeta.RU: Link, original source: Link

Downloading & Extracting datasets

wget https://s3.amazonaws.com/opennmt-models/Summary/cnndm.tar.gz
wget http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz
wget https://www.dropbox.com/s/cmpfvzxdknkeal4/gazeta_jsonl.tar.gz

tar -xzf cnndm.tar.gz
tar -xzf XSUM-EMNLP18-Summary-Data-Original.tar.gz
tar -xzf gazeta_jsonl.tar.gz

FastText models

Our trained FastText models

CNN-DailyMail: Link

XSum: Link

Gazeta: Link

See our FastText page for training details.

UDPipe models

UDPipe models available for English:

  • UDPipe-English EWT: Link (Used in our experiments, see data directory)
  • UDPipe-English Patut: Link
  • UDPipe-English Lines: Link
  • UDPipe-English Gum: Link

Other UDPipe models: Link

Adaptation for Russian

As approach we use for ELSA is language-independent, we can easily adapt it to other languages. For Russian, we finetune mBart on the Gazeta dataset, train additional FastText model, and use UDPipe model built for Russian texts.

UDPipe models for Russian

  • UDPipe-Russian Syntagrus: Link
  • UDPipe-Russain GSD: Link (Used in our experiments, see data directory)
  • UDPipe-Russian Taiga: Link

mBART checkpoint

HuggingFace checkpoint: Link

Codestyle check

Before making a commit / pull-request, please check the coding style by running the bash script in the codestyle directory. Make sure that your folder is included in codestyle/pycodestyle_files.txt list.

Your changes will not be approved if the script indicates any incongruities (this does not apply to 3rd-party code).

Usage:

cd codestyle
sh check_code_style.sh

elsa's People

Contributors

crafterkolyan avatar ericspector avatar jefft13 avatar maks5507 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.