ISO-Standard-DA

A resource to create an ISO-compliant Dialogue Act Tagger using publicly available data.

If you use our code, please cite our COLING 2018 paper "ISO-Standard Domain-Independent Dialogue Act Tagging for Conversational Agents":

Mezza, S., Cervone, A., Stepanov, E., Tortoreto, G., & Riccardi, G. (2018, August). ISO-Standard Domain-Independent Dialogue Act Tagging for Conversational Agents. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 3539-3551).

Getting started

This repository contains a set of utilities which can be used to train and test a fully functional Dialogue Act Tagger compliant to the new ISO DA scheme. This software only uses publicly available datasets to train the models. However, some of these resources may require authorizations before they are used. Please check that you have all the available data before using this code. You can find information on how to obtain the required corpora on their official websites:

For the Switchboard Dialogue Act Corpus: https://web.stanford.edu/~jurafsky/swb1_dialogact_annot.tar.gz
For Oasis BT: http://groups.inf.ed.ac.uk/oasis/ (please note that Oasis BT's license has expired and thus to our knowledge the corpus is no longer available)
For Maptask http://groups.inf.ed.ac.uk/maptask/
For VerbMobil2: https://www.phonetik.uni-muenchen.de/Bas/BasVM2eng.html
For AMI: http://groups.inf.ed.ac.uk/ami/corpus/

Each corpus folder has a data subfolder which must contain the appropriate resource's unzipped parent folder. The file DialogueActTrain.py contains a small main method which can be used to train a Dialogue Act Tagger. If you do not have access to some of the resources you can easily comment the corresponding line to avoid using that resource when training. Your DA tagger will then be stored in the models folder. You can then use the DialogActTagger class in the self-named Python file to test it and use it. The code was written using Python3 and requires spaCy 1.8 and the latest version of Scikit-learn to run.

The Corpus class and the corpora mappers

The main component of the code architecture is the Corpus interface and their extensions, which handle conversion of the corporas' original scheme to the ISO standard. The interface is built around four main steps:

Loading (usually handled in the constructor)
Converting to CSV (load_csv method). This is necessary since some of these resources were annotated in a quite unconvenient XML format which makes it hard to read for a human annotator.
Mapping to ISO standard (create_csv method)
Dumping (a separate method was implemented for each ISO dimension, plus one to just output the original corpus annotation in CSV).

The csv is comma-separated and has a very simple structure. Each row is a tuple in the form

(sentence, DA tag, previous DA tag, segment, additional info, previous additional info)

where

sentence is the annotated utterance
DA tag is the corresponding DA tag
previous DA tag is the DA tag of the previous sentence
segment is an index which encodes the logical segment to which the utterance belongs (many of these corpora contain multi-utterance DA annotations).
additional info contains a JSON with additional information required to map this corpus (for example, for Oasis it contains an additional label used to encode DAs)
previous additional info contains the additional info for the previous sentence

The Corpus interface can be followed to implement mapping to additional dialogue corpora, which then will be usable in the DialogueActTrain class, which handles training of the statistical model.

Training a model

Training a model is very easy, and is handled by the DialogueActTrain class. The class has a train_all method which trains a complete model for ISO DA annotation and dumps it in the output_folder folder. Features for the single models can be enabled/disabled at will, and new features are easy to add since the code uses extendible Scikit Learn Pipelines. Training is done using Support Vector Machine (LinearSVC) for each classifier of the pipeline. Dumping is done by using the pickle library, which is included in every version of Python > 2.6

Using the DA tagger

Once the training is complete, the DA tagger is ready to use. The DialogueActTagger class loads a trained model and exposes methods to tag a sentence with the complete taxonomy or to just tag a single dimension. The confidence of each model is also returned by these methods to give a feedback on the reliability of the prediction.

bounabyazid / dialogueact-tagger Goto Github PK

dialogueact-tagger's Introduction

ISO-Standard-DA

Getting started

The Corpus class and the corpora mappers

Training a model

Using the DA tagger

dialogueact-tagger's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent