Giter Site home page Giter Site logo

jkkummerfeld / irc-disentanglement Goto Github PK

View Code? Open in Web Editor NEW
52.0 7.0 16.0 111.04 MB

Dataset and model for disentangling chat on IRC

License: Other

C++ 34.54% Python 64.45% Shell 0.66% Makefile 0.21% HTML 0.13%
dataset nlp natural-language-processing neural-network dynet irc dialogue dialog disentanglement

irc-disentanglement's Introduction

irc-disentanglement

This repository contains data and code for disentangling conversations on IRC, as described in the following two papers:

Conversation disentanglement is the task of identifying separate conversations in a single stream of messages. For example, the image below shows two entangled conversations and an annotated graph structure (indicated by lines and colours). The example includes a message that receives multiple responses, when multiple people independently help BurgerMann, and the inverse, when the last message responds to multiple messages. We also see two of the users, delire and Seveas, simultaneously participating in two conversations.

Image of an IRC message log with conversations marked

The 2019 paper:

  1. Introduces a new dataset, with disentanglement for 77,563 messages of IRC.
  2. Introduces a new model, which achieves significantly higher results than prior work.
  3. Re-analyses prior work, identifying issues with data and assumptions in models.

The 2023 paper:

  1. Introduces a multi-domain dataset, with enough annotated data for evaluation.
  2. Studies annotation methods, showing that guidance can improve accuracy of non-expert annotation, but crowd annotation remains a challenge.

To get our code and data, download this repository in one of these ways:

The data is also available here:

This repository contains:

  • The annotated data for Ubuntu, Channel Two (2019 paper), and four new channels (2023 paper).
  • The code for our model.
  • The code for tools that do evaluation, preprocessing and data format conversion.
  • A collection of 496,469 automatically disentangled conversations from 2004 to 2019 in a bzip2 file.

If you use the data or code in your work, please cite our work as:

@InProceedings{acl19disentangle,
  author    = {Jonathan K. Kummerfeld and Sai R. Gouravajhala and Joseph Peper and Vignesh Athreya and Chulaka Gunasekara and Jatin Ganhotra and Siva Sankalp Patel and Lazaros Polymenakos and Walter S. Lasecki},
  title     = {A Large-Scale Corpus for Conversation Disentanglement},
  booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
  location  = {Florence, Italy},
  month     = {July},
  year      = {2019},
  doi       = {10.18653/v1/P19-1374},
  pages     = {3846--3856},
  url       = {https://aclweb.org/anthology/papers/P/P19/P19-1374/},
  arxiv     = {https://arxiv.org/abs/1810.11118},
  software  = {https://www.jkk.name/irc-disentanglement},
  data      = {https://www.jkk.name/irc-disentanglement},
}

@InProceedings{alta23disentangle,
  author    = {Sai R. Gouravajhala and Andrew M. Vernier and Yiming Shi and Zihan Li and Mark Ackerman and Jonathan K. Kummerfeld},
  title     = {Chat Disentanglement: Data for New Domains and Methods for More Accurate Annotation},
  booktitle = {Proceedings of the The 21st Annual Workshop of the Australasian Language Technology Association},
  location  = {Melbourne, Australia},
  month     = {November},
  year      = {2023},
  doi       = {},
  pages     = {},
  url       = {},
  arxiv     = {},
  data      = {https://www.jkk.name/irc-disentanglement},
}

Running and Reproducing Results

See the src folder README for detailed instructions on running the system. Additional evaluation script information can be found in the tools README.

Updates

  1. The description of the voting ensemble in the paper has a mistake. When not all models agree, the most agreed upon link is chosen (ties are broken by choosing the shorter link).

Questions

If you have a question please either:

Contributions

If you find a bug in the data or code, please submit an issue, or even better, a pull request with a fix. I will be merging fixes into a development branch and only infrequently merging all of those changes into the master branch (at which point this page will be adjusted to note that it is a new release). This approach is intended to balance the need for clear comparisons between systems, while also improving the data.

Acknowledgments

The material from the 2019 paper is based in part upon work supported by IBM under contract 4915012629. The material from the 2023 paper is based in part upon work supported by DARPA (grant #D19AP00079), and the ARC (DECRA grant). Any opinions, findings, conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views of these other organisations.

irc-disentanglement's People

Contributors

jkkummerfeld avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

irc-disentanglement's Issues

majority vote for graph inconsistent with paper?

Hi, I was trying to reproduce your results in the paper, and I might be wrong, but it seems that this part of the code from majority_vote.py (54-58) is inconsistent with what's mentioned in the paper?

        if options[0][0] >= MIN_AGREE:
            if options[0][1] != src:
                keep = [n for c, n in options if c >= MIN_AGREE and n != src]
        elif options[0][1] != src:
            keep = [options[0][1]]

In the paper you said:

... combine output by keeping the edges they all agree on. Link messages with no agreed antecedent to themselves.

In the code, it seems that you:

  1. do not keep the edges that all agree on if they are self-links, and
  2. if there's no agreed antecedent, you link it to the most-voted link IF the most-voted one is not a self-link?

As a result, I wasn't able to reproduce the results reported on your paper.

Release test data and conversations

As noted in the README.md, the test data is being withheld until the end of the DSTC 8 shared task. This issue is a reminder to add them after that (ie. in November).

Annotations to cluster for "conversation-eval"

I converted the output (auto) to graphs with output-from-py-to-graph.py and then to clusters with graph-to-cluster.py.

I am now attempting to use the conversation-eval.py file to get cluster metrics but I am running into an issue finding how to create the gold file which is input to this script. What would be the proper way to convert the annotation files to "gold clusters"?

Thank for any help.

About Annotations

Hello! Your work is great, but I am confused about the annotations:

In the readme file of data file folder, you give one example from the files of the form `2007-12-17.train-a.* and the corresponding annotations are 993 1000 -.

But when I use python to process raw/ascii/tok file, the line at 992 is not '[03:41] amitprakash...'. I also use my VSCode to check these txt files and I think these problems are caused by "when to start a new line": In your given example, the response from 'ubotu' is split into six lines while in the raw/ascii/tok file there is only one line for this response.

So I would like to know how to fix these disorders or if I misunderstand anything.

Thanks,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.