Giter Site home page Giter Site logo

pariajm / e2e-asr-and-disfluency-removal-evaluator Goto Github PK

View Code? Open in Web Editor NEW
19.0 2.0 1.0 247 KB

A new metric for evaluating end-to-end speech recognition and disfluency removal systems

Home Page: https://www.aclweb.org/anthology/2020.findings-emnlp.186.pdf

Python 100.00%
joint-asr-and-disfluency-detection disfluent-transcripts e2e-metric asr-evaluation fluency-measure end2end-disfluency-metric e2e-asr-metric evaluating-e2e-models minimum-edit-distance levenshtein-distance

e2e-asr-and-disfluency-removal-evaluator's Introduction

End-to-end Speech Recognition and Disfluency Removal

Evaluation code for joint ASR and disfluency removal based on End-to-End Speech Recognition and Disfluency Removal from EMNLP Findings 2020.

Contents

  1. Basic Overview
  2. Abstract
  3. Why Word Error Rate (WER) is insufficient?
  4. New Metrics
  5. Requirement
  6. Align and Score
  7. Contact

Basic Overview

basic overview

Abstract

Disfluency detection is usually an intermediate step between an automatic speech recognition (ASR) system and a downstream task (e.g. machine translation). By contrast, we investigate the task of end-to-end speech recognition and disfluency removal. We specifically explore whether it is possible to train an ASR model to directly map disfluent speech into fluent transcripts, without relying on a separate disfluency detection model. We show that end-to-end models do learn to directly generate fluent transcripts; however, their performance is slightly worse than a baseline pipeline approach consisting of an ASR system and a disfluency detection model. We also propose two new metrics that can be used for evaluating integrated ASR and disfluency models.

Why Word Error Rate (WER) is insufficient?

Word Error Rate (WER) is a standard metric for evaluating ASR models; however, it is insufficient for evaluating outputs of end-to-end ASR and disfluency removal systems as it fails to measure the fluency of generated transcript. Consider the following example! "Ref" is the reference transcript which is human-transcribed speech with gold disfluency labels, shown in red. "E2E" represents the output of an end-to-end ASR and disfluency removal model (i.e. hypothesis).

The reference transcript contains both fluent and disfluent words, so a WER of zero on the full transcript means that the system returned all of the disfluent words as well as the fluent words, which is not what an integrated system should do (as it is expected to recognize all fluent words and discard all disfluencies in the output). This repo contains the code for calculating two new metrics called Fluent Error Rate (FER) and Disfluent Error Rate (DER) which are used to assess ASR outputs in terms of fluency and word recognition performance.

New Metrics

To overcome the limitations of WER, we use the standard WER evaluation to evaluate fluent and disfluent words separately. In this way, the quality of end-to-end model outputs is evaluated in terms of both fluency and word recognition. We calculate the word error rate on fluent words (which we call the fluent error rate or FER) as the number of substitutions, deletions and insertions among fluent words divided by the total number of fluent words in the reference transcript. We define the word error rate on disfluent words (which we call the disfluent error rate or DER) as anything other than a deletion (i.e. substitutions, insertions and copies) among disfluent words divided by the total number of disfluent words in the reference transcript. We calculate FER and DER using a modified ASR alginment algorithm which uses two slightly different sets of cost for aligning fluent and disfluent regions. Since many disfluencies are copies (e.g. "The the the student is here"), if the same cost is used to align fluent and disfluent words, the alignment will be ambiguous (i.e. there will be multiple alignments with the same cost). On the other hand, we expect an aligner to match ASR words to the fluent words in the reference and align null (i.e. deletions) with the disfluent ones in the reference. Therefore, a standard ASR alignment algorithm can lead to undesirable alignments in end-to-end ASR and disfluency removal models. As a solution, we suggest to use two slightly different sets of cost for aligning fluent and disfluent regions such that the aligner would bias away from selecting disfluent words. The new evaluation metrics and alignment weights are useful for aligning and evaluating any end-to-end systems trained to remove disfluency in the output, e.g. end-to-end speech translation and disfluency removal.

Operation Fluent Disfluent
copy 0 0 + 1e-7
deletion 3 3 - 1e-7
insertion 3 3 + 1e-7
substitution 4 4 + 1e-7

Consider the following example! "Align 1" represents the alignment between the output of an integrated ASR and disfluency model and the reference transcript generated by the standard alignment costs where equal costs are allocated for aligning fluent and disfluent words (copy=0, deletion=3, insertion=3 and substitution=4). "Align 2" is the desired alignment in order to make meaningful FER and DER evaluations which is achieved by using two different sets of cost for aligining fluent and disfluent words.

Requirement

  • Python 3.6 or higher (should work with python 2 as well)

Align and Score

The format of files passed to --ref and --hyp should be one sentence per line. All words in reference and hypothesis files should be in lowercase, except for GOLD DISFLUENT WORDS in reference which are presented in uppercase letters.

To align the reference and hypothesis transcripts and calculate WER using a standard ASR alginment algorithm:

$ python3 main.py --ref /path/to/reference/file --hyp /path/to/hypothesis/file --mode align --result-path /path/to/save/aligments/and/scores

To align the reference and hypothesis transcripts and calculate FER and DER using the modified ASR alginment algorithm:

$ python3 main.py --ref /path/to/reference/file --hyp /path/to/hypothesis/file --mode mod_align --result-path /path/to/saving/aligments/and/scores

To test the modified algorithm using the pre-defined reference and hypothesis sentences:

$ python3 main.py --mode test

Citation

If you use the code, please cite the following paper:

@inproceedings{jamshid-lou-johnson-2020-end,
    title = "End-to-End Speech Recognition and Disfluency Removal",
    author = "Jamshid Lou, Paria  and Johnson, Mark",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.186",
    doi = "10.18653/v1/2020.findings-emnlp.186",
    pages = "2051--2061"
}

Contact

Paria Jamshid Lou [email protected]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.