Giter Site home page Giter Site logo

bounabyazid / dialogueact-tagger Goto Github PK

View Code? Open in Web Editor NEW

This project forked from arritmic/dialogueact-tagger

0.0 0.0 0.0 42.64 MB

A resource to create a multi domain Dialog Act Tagger for conversational agents using publicly available data

Python 25.73% Shell 0.90% HTML 73.08% CSS 0.29%

dialogueact-tagger's Introduction

ISO-Standard-DA

A resource to create an ISO-compliant Dialogue Act Tagger using publicly available data.

If you use our code, please cite our COLING 2018 paper "ISO-Standard Domain-Independent Dialogue Act Tagging for Conversational Agents":

Mezza, S., Cervone, A., Stepanov, E., Tortoreto, G., & Riccardi, G. (2018, August). ISO-Standard Domain-Independent Dialogue Act Tagging for Conversational Agents. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 3539-3551).

Getting started

This repository contains a set of utilities which can be used to train and test a fully functional Dialogue Act Tagger compliant to the new ISO DA scheme. This software only uses publicly available datasets to train the models. However, some of these resources may require authorizations before they are used. Please check that you have all the available data before using this code. You can find information on how to obtain the required corpora on their official websites:

  1. For the Switchboard Dialogue Act Corpus: https://web.stanford.edu/~jurafsky/swb1_dialogact_annot.tar.gz
  2. For Oasis BT: http://groups.inf.ed.ac.uk/oasis/ (please note that Oasis BT's license has expired and thus to our knowledge the corpus is no longer available)
  3. For Maptask http://groups.inf.ed.ac.uk/maptask/
  4. For VerbMobil2: https://www.phonetik.uni-muenchen.de/Bas/BasVM2eng.html
  5. For AMI: http://groups.inf.ed.ac.uk/ami/corpus/

Each corpus folder has a data subfolder which must contain the appropriate resource's unzipped parent folder. The file DialogueActTrain.py contains a small main method which can be used to train a Dialogue Act Tagger. If you do not have access to some of the resources you can easily comment the corresponding line to avoid using that resource when training. Your DA tagger will then be stored in the models folder. You can then use the DialogActTagger class in the self-named Python file to test it and use it. The code was written using Python3 and requires spaCy 1.8 and the latest version of Scikit-learn to run.

The Corpus class and the corpora mappers

The main component of the code architecture is the Corpus interface and their extensions, which handle conversion of the corporas' original scheme to the ISO standard. The interface is built around four main steps:

  • Loading (usually handled in the constructor)
  • Converting to CSV (load_csv method). This is necessary since some of these resources were annotated in a quite unconvenient XML format which makes it hard to read for a human annotator.
  • Mapping to ISO standard (create_csv method)
  • Dumping (a separate method was implemented for each ISO dimension, plus one to just output the original corpus annotation in CSV).

The csv is comma-separated and has a very simple structure. Each row is a tuple in the form

(sentence, DA tag, previous DA tag, segment, additional info, previous additional info)

where

  • sentence is the annotated utterance
  • DA tag is the corresponding DA tag
  • previous DA tag is the DA tag of the previous sentence
  • segment is an index which encodes the logical segment to which the utterance belongs (many of these corpora contain multi-utterance DA annotations).
  • additional info contains a JSON with additional information required to map this corpus (for example, for Oasis it contains an additional label used to encode DAs)
  • previous additional info contains the additional info for the previous sentence

The Corpus interface can be followed to implement mapping to additional dialogue corpora, which then will be usable in the DialogueActTrain class, which handles training of the statistical model.

Training a model

Training a model is very easy, and is handled by the DialogueActTrain class. The class has a train_all method which trains a complete model for ISO DA annotation and dumps it in the output_folder folder. Features for the single models can be enabled/disabled at will, and new features are easy to add since the code uses extendible Scikit Learn Pipelines. Training is done using Support Vector Machine (LinearSVC) for each classifier of the pipeline. Dumping is done by using the pickle library, which is included in every version of Python > 2.6

Using the DA tagger

Once the training is complete, the DA tagger is ready to use. The DialogueActTagger class loads a trained model and exposes methods to tag a sentence with the complete taxonomy or to just tag a single dimension. The confidence of each model is also returned by these methods to give a feedback on the reliability of the prediction.

dialogueact-tagger's People

Contributors

alecervi avatar colingpaper2018 avatar arritmic avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.