- Pre-Print Available on arXiv
- Repository is continually being updated with more intstructions, eventually I hope to have a colab that can be used to run the evaluations in the paper using the pretrained checkpoints provided.
- As repo is w.i.p if you cannot figure out how to use anything please feel free to contact me by creating an issue!
For lanaguage model decoding the following repo must also be installed: https://github.com/robflynnyh/language_modelling instructions on how to properly install this repo and the required libraries will be provided a.s.a.p
- Requires Pytorch 2.0 or greater
- currently we are using flash-attention 1 (update to v2 planned in future)
- Apex is used for fused rms/layer norm (and fused Adam if not using madgrad) TODO: setup code to work without flash-attention and fused layers installed for easier usage
- For training models you must request access to receive the spotify training data which can be done via the following: link
- Evaluation dev/test splits for Earnings-22 and Tedlium can be found in /data
- Alternatively the full datasets can be found via the following links: Earnings-22 Tedlium
Config files for all pretrained models are provided within the checkpoint file
- Only Greedy WERs are reported here, results when decoding using shallow fusion with transformer will be added to language modelling section.
Below is the best performing model that I have trained so far, which will be continually updated. This is trained for more epochs than the models in the paper, and specaugment is used.
Context | Epochs | Seq Warmup | SpecAugment | Tedlium (WER) | Earnings-22 (WER) | Download |
---|---|---|---|---|---|---|
160s | 9 | Yes | Yes | 5.3 | 14.3 | here |
Below are model checkpoints for Acoustic models discussed in the paper. The greedy WERs (no LM) are also provided using overlapping inferernce (87.5% overlap). For checkpoints with multiple repeats the average WERs are provided. Models can be loaded from pretrained checkpoints using the load_pretrained.py script.
Context | Epochs | Seq Warmup | Tedlium (WER) | Earnings-22 (WER) | Download |
---|---|---|---|---|---|
80s | 2 | Yes | 6.1 | 17.1 | here |
1 hour | 1 | Yes | 6.4 | 18.8 | here |
320s | 1 | Yes | 6.5 | 18.6 | here |
160s | 1 | Yes | 6.5 | 18.7 | here |
80s | 1 | Yes | 6.5 | 18.7 | here |
40s | 1 | Yes | 6.7 | 19.2 | here |
40s | 1 | No | 6.5 | 19.4 | here |
20s | 1 | No | 6.6 | 19.4 | here |
10s | 1 | No | 6.8 | 20.5 | here |
5s | 1 | No | 7.4 | 21.9 | here |
- Language Model checkpoint added soon! Below is the results for the best performing model when decoding using the transformer LM (decoding config: \alpha 0.45; \beta 1.53; p cutoff 3.17; top_am_threshold -6; beam width 25; LM context 1024 tokens)
Tedlium (WER) | Earnings-22 (WER) |
---|---|
4.2 | 11.9 |
- For comparison to the whisper model here is an evaluation conducted using whispers long form evaluation setting (greedy decoding i.e defailt settings using the whisper library). As shown our system can be competitive with whisper in some settings/model sizes - although using beam search with whisper would improve results.
- We use the test/dev splits specified in ESB for earnings-22 whereas the entire dataset is used in the Whisper paper hence results are different from paper here
Model | Tedlium (WER) | Earnings-22 (WER) |
---|---|---|
Base.en | 4.6* | 12.4 |
Small.en | 4.6* | 10.2 |
*taken from paper (Table 16)
A messy dump of experimental results including WERs for each repeat and to a higher precision presented paper, can be found here
- Finish install instructions
- Add ability to use without installing fused kernels
- add language model checkpoints and instructions