Giter Site home page Giter Site logo

uniparse's Introduction

UniParse - A framework for graph-based dependency parsing.

UniParse is a universal graph-based modular parsing framework, for quick prototyping and comparison of parsers and parser components. With this framework we provide a collection of helpful tools and implementations that can assist the process of developing graph based dependency parsers.

Installation

Since UniParse contains cython code that depends on numpy, installation must be done in two steps.

# (1)
pip install -r requirements.txt

# (2)
pip install [-e] .  # include -e if you'd like to be able to modify framework code

[Neural Models]

UniParse includes a small collection of state-of-the-art neural models that are implemented using a high level model wrapper that should reduce development time significantly. This high-level model wrapper component currently supports two neural backends, namely: DyNet and PyTorch. One of these libraries are required to use the model wrapper.

# uncomment desired (if any) backend
# pip install dynet>=2.1
# pip install torch>=1.0

Components

Blazing-fast decoders

Algorithm en_ud en_ptb sentences/s % faster
CLE (Generic) 19.12 93.8 ~ 404 -
Eisner (Generic) 96.35 479.1 ~ 80 -
CLE (UniParse) 1.764 6.98 ~ 5436 1345%
Eisner (UniParse) 1.49 6.31 ~ 6009 7500%
import numpy as np
from uniparse.decoders import eisner, cle

score_matrix = np.eye(10, k=-1)
eisner(score_matrix)
# > array([-1,  2,  3,  4,  5,  6,  7,  8,  9,  0])

cle(score_matrix)
# > array([-1,  2,  3,  4,  5,  6,  7,  8,  9,  0])

Evaluation

UniParse includes an evaluation script that works from within the framework, as well as by itself. For the former:

from uniparse.evaluate import conll17_eval  # Wrapped UD evaluation script
from uniparse.evaluate import perl_eval  # Wrapped CONLL2006/2007 perl script. Ignores unicode punctuations (used for SOTA reports)
from uniparse.evaluate import evaluate_files  # UniParse rewritten evaluation. Provides scores with and without punctuation.


conll17_eval(test_file, gold_reference)
# > {"uas": ..., "las": ...)
metrics2 = perl_eval(test_file, gold_reference)
# > {"uas": ..., "las": ...)
metrics3 = evaluate_files(test_file, gold_reference)
# > {
#   "uas": .., 
#   "las": .., 
#   "nopunct_uas": .., 
#   "nopunct_las": .., 
# }

... and for the latter, please copy the following path to a desired location uniparse/evaluate/uniparse_evaluate.py and use by running

python uniparse_evaluate.py --test [FILENAME.CONLLU] --gold [GOLD_REFERENCE.CONLLU]

Vocabulary

from uniparse import Vocabulary

vocab = Vocabulary().fit(CONLL_FILE, EMBEDING_FILE)
data = vocab.tokenize_conll(TRAIN_FILE)
word_ids, lemma_ids, upos_ids, gold_arcs, gold_rels, chars_ids = data[0]

Model Wrapper

vocab = ... # as above
params = ... # included or custom model 
parser = Model(
    PARAMETERS,
    decoder="eisner",
    loss="kiperwasser",
    optimizer="adam",
    strategy="bucket",
    vocab=vocab,
)
parser.train(
    train_data,
    dev_filename,
    dev_data,
    epochs=epochs,
    batch_size=32,
    patience=3,
)

predictions = parser.run(test_file, test_data)
metrics = parser.evaluate(test_file, test_data)

Batching

from uniparse.dataprovider import ScaledBatcher, BucketBatcher

dataprovider1 = BucketBatcher(data, padding_token=vocab.PAD)
idx, batches = dataprovider.get_data(scale, shuffle=True)

dataprovider2 = ScaledBatcher(data, padding_token=vocab.PAD)
idx, batches = dataprovider.get_data(shuffle=True)

Included models

With Uniparse we include a set of state-of-the-art neural models composed entirely by UniParse components, as well as training scripts. We invite you to use these models as a starting point and freely rerun, modify and extend the models for further development or evaluation.

You'll find all the training scripts under /scripts

# example 
python scripts/run_dynet_kiperwasser.py --train TRAIN_FILE --dev DEV_FILE --test TEST_FILE --model_dest MODEL --epochs 30
Model Language UAS w.p. LAS w.p. UAS n.p. LAS n.p.
Kiperwasser & Goldberg (2016)
Danish (UD) 83.18% 79.57% 83.67% 79.47%
English (UD) 87.06% 84.68% 88.08% 85.43%
English (PTB) 92.56% 91.17% 93.14% 91.57%
Dozat & Manning (2017) -
Danish (UD) 87.42% 84.98% 87.84% 84.99%
English (UD) 90.74 89.01% 91.47 89.38
English (PTB) 94.91% 93.70% 95.43% 94.06%
MST (non-neural) -
Danish (UD) 67.17 55.52 68.80 55.30
English (UD) 73.47 65.20 75.55 66.25
English (PTB) 74.00 63.60 76.07 64.67

with w.p. and n.p. denoting 'with punctuation', 'no punctuation'. No punctuation follows the rule of excluding modifier tokens consisting entirely of unicode punctuation characters; this definition is standard in current research.

Note that these models must be trained. We are actively working on providing downloadable pretrained models.

PTB split

Since the splitting of Penn treebank files is not fully standerdised we indicate the split used in experiments from our paper, as well as supporting literature. Note that published model performances for systems we re-implement and distribute with UniParse may use different splits, which have a observerable impact on performance. Specifically, we note that Dozat and Manning's parser performs differently even using under splits than reported in their paper.

Train Dev Test Discard
{02-21} {22} {23} {00}

Citation

If you use UniParse, please cite our paper.

@inproceedings{varab-schluter-2019-uniparse,
    title = "{U}ni{P}arse: A universal graph-based parsing toolkit",
    author = "Varab, Daniel  and Schluter, Natalie",
    booktitle = "Proceedings of the 22nd Nordic Conference on Computational Linguistics",
    month = "30 " # sep # " {--} 2 " # oct,
    year = "2019",
    address = "Turku, Finland",
    publisher = {Link{\"o}ping University Electronic Press},
    url = "https://www.aclweb.org/anthology/W19-6149",
    pages = "406--410",
    abstract = "This paper describes the design and use of the graph-based parsing framework and toolkit UniParse, released as an open-source python software package. UniParse as a framework novelly streamlines research prototyping, development and evaluation of graph-based dependency parsing architectures. UniParse does this by enabling highly efficient, sufficiently independent, easily readable, and easily extensible implementations for all dependency parser components. We distribute the toolkit with ready-made configurations as re-implementations of all current state-of-the-art first-order graph-based parsers, including even more efficient Cython implementations of both encoders and decoders, as well as the required specialised loss functions.",
}

uniparse's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

uniparse's Issues

Pre-trained models

To make development and use of dependency parsers easier we should train and host parameters so that users don't have to train their own models to get started. This should be useful for trying out dependency parsers on new data, using the annotations in some other task, or simply to produce baseline numbers straight out of the box.

  • Provide pre-trained models
  • Test

Support underscore heads to enable predictions on unannotated data

The Vocabulary class can't deal with UD samples that do not have integer heads, and the current work around is putting some arbitrary integer in that specific column. It would be nice to have the vocabulary support "_" for dependency heads just like the other feature columns. This would enable parsing of unannotated data (still in conllu format though).

  • Enable empty heads
  • Tests

In-memory evaluation

Currently there is no support for in-memory evaluation. This causes several components to require reference .conllu files to work, which is undesirable.

TODOS:

  1. Implement in-memory evaluation.
  2. Remove dependency on reference files for training.
  3. Tests

Generalize dataprovider implementations

The implementations found in uniparse/dataprovider.py are to complex and don't generalize to arbitrary shaped input. The implementations are currently hardcoded in such a way that they depend on the order of the input (e.g. what is 2d and 3d, and how inputs there are ordered). This can be completely avoided as long as input is "stackable".

  • Implement generalized versions of dataproviders
  • Test

exceptions must derive from BaseException

your error message is not an "exception":

Traceback (most recent call last):
File "/home/robv/data_size/uniparse/uniparse/model.py", line 19, in
import uniparse.decoders as decoders
File "/home/robv/data_size/uniparse/uniparse/decoders/init.py", line 2, in
from uniparse.decoders.eisner import Eisner as eisner
ModuleNotFoundError: No module named 'uniparse.decoders.eisner'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run_pytorch_kiperwasser.py", line 6, in
from uniparse.callbacks import ModelSaveCallback
File "/home/robv/data_size/uniparse/uniparse/init.py", line 4, in
from uniparse.model import Model
File "/home/robv/data_size/uniparse/uniparse/model.py", line 21, in
raise ERROR_MSG

Coupled training and evaluation in scripts

Included scripts only support training, evaluation and testing in one go. This is not convenient for real world experiments where we'd like to do these separately. In addition it has come to my attention that there is a limitation in that the scripts don't support simply running a trained model on unlabelled data. These limitations should be fixed by refactoring the scripts into modular modes:

  1. train (with optional development set) to train and save a model (incl. vocab).
  2. evaluate to take a pre-trained model and predict and evaluate on labelled data.
  3. run to take a pre-trained model and produce predictions to a file.

NOTE: All input and output should be .conllu formatted files for the time being. For scenarios, such as run, dummy values should be allowed.

API Documentation

UniParse as a project is severely in need of documentation of all included components. Initially, this could merely be achieved through some sort of docstring documentation such as that seen in dynet (doc, docstrings).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.