Giter Site home page Giter Site logo

long-context-asr's Introduction

Code for the paper: How Much Context Does My Attention-Based ASR System Need?

  • Pre-Print Available on arXiv
  • Repository is continually being updated with more intstructions, eventually I hope to have a colab that can be used to run the evaluations in the paper using the pretrained checkpoints provided.
  • As repo is w.i.p if you cannot figure out how to use anything please feel free to contact me by creating an issue!

Installation

For lanaguage model decoding the following repo must also be installed: https://github.com/robflynnyh/language_modelling instructions on how to properly install this repo and the required libraries will be provided a.s.a.p

  • Requires Pytorch 2.0 or greater
  • currently we are using flash-attention 1 (update to v2 planned in future)
  • Apex is used for fused rms/layer norm (and fused Adam if not using madgrad) TODO: setup code to work without flash-attention and fused layers installed for easier usage

Data

  • For training models you must request access to receive the spotify training data which can be done via the following: link
  • Evaluation dev/test splits for Earnings-22 and Tedlium can be found in /data
  • Alternatively the full datasets can be found via the following links: Earnings-22 Tedlium

Checkpoints

Config files for all pretrained models are provided within the checkpoint file

Acoustic Model

  • Only Greedy WERs are reported here, results when decoding using shallow fusion with transformer will be added to language modelling section.

Below is the best performing model that I have trained so far, which will be continually updated. This is trained for more epochs than the models in the paper, and specaugment is used.

Context Epochs Seq Warmup SpecAugment Tedlium (WER) Earnings-22 (WER) Download
160s 9 Yes Yes 5.3 14.3 here

Below are model checkpoints for Acoustic models discussed in the paper. The greedy WERs (no LM) are also provided using overlapping inferernce (87.5% overlap). For checkpoints with multiple repeats the average WERs are provided. Models can be loaded from pretrained checkpoints using the load_pretrained.py script.

Context Epochs Seq Warmup Tedlium (WER) Earnings-22 (WER) Download
80s 2 Yes 6.1 17.1 here
1 hour 1 Yes 6.4 18.8 here
320s 1 Yes 6.5 18.6 here
160s 1 Yes 6.5 18.7 here
80s 1 Yes 6.5 18.7 here
40s 1 Yes 6.7 19.2 here
40s 1 No 6.5 19.4 here
20s 1 No 6.6 19.4 here
10s 1 No 6.8 20.5 here
5s 1 No 7.4 21.9 here

Language Model

  • Language Model checkpoint added soon! Below is the results for the best performing model when decoding using the transformer LM (decoding config: \alpha 0.45; \beta 1.53; p cutoff 3.17; top_am_threshold -6; beam width 25; LM context 1024 tokens)
Tedlium (WER) Earnings-22 (WER)
4.2 11.9

Comparison to Whisper

  • For comparison to the whisper model here is an evaluation conducted using whispers long form evaluation setting (greedy decoding i.e defailt settings using the whisper library). As shown our system can be competitive with whisper in some settings/model sizes - although using beam search with whisper would improve results.
  • We use the test/dev splits specified in ESB for earnings-22 whereas the entire dataset is used in the Whisper paper hence results are different from paper here
Model Tedlium (WER) Earnings-22 (WER)
Base.en 4.6* 12.4
Small.en 4.6* 10.2

*taken from paper (Table 16)

All Results

A messy dump of experimental results including WERs for each repeat and to a higher precision presented paper, can be found here

TODO's

  • Finish install instructions
  • Add ability to use without installing fused kernels
  • add language model checkpoints and instructions

long-context-asr's People

Contributors

robflynnyh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

ysdede

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.