Audiomer
Audiomer: A Convolutional Transformer for Keyword Spotting
Accepted at AAAI 2022 DSTC Workshop
[ arXiv ] |
[ Previous SOTA ] |
[ Model Architecture ] |
---|
Pretrained Models
-
Links: Google Drive
-
Note:
The pretrained models only work with commit 6270ca27de47fbfd0379c172bbc74e6a61f72176
, after which there has been breaking changes.
Usage
To reproduce the results in the paper, follow the instructions:
- To download the Speech Commands v2 dataset, run:
python3 download_speechcommands.py
- To train Audiomer-S and Audiomer-L on all three datasets thrice, run:
python3 run_expts.py
- To evaluate a model on a dataset, run:
python3 evaluate.py --checkpoint_path /path/to/checkpoint.ckpt --model <model type> --dataset <name of dataset>
. - For example:
python3 evaluate.py --checkpoint_path ./epoch=300.ckpt --model S --dataset SC20
Results
![](assets/results.png)
Performer Conv-Attention
TLDR: We augment 1D ResNets With Performer Attention over Raw Audio waveform.
![](assets/ConvAttention.png)
System requirements
- NVIDIA GPU with CUDA
- Python 3.6 or higher.
- pytorch_lightning
- torchaudio
- performer_pytorch