Keyword Transformer: A Self-attention Model For Keyword Spotting
2021 Β· Axel Berg, Mark O'Connor, Miguel Tairum Cruz
Abstract
The Transformer architecture has been successful across many domains, including natural language processing, computer vision and speech recognition. In keyword spotting, self-attention has primarily been used on top of convolutional or recurrent encoders. We investigate a range of ways to adapt the Transformer architecture to keyword spotting and introduce the Keyword Transformer (KWT), a fully self-attentional architecture that exceeds state-of-the-art performance across multiple tasks without any pre-training or additional data. Surprisingly, this simple architecture outperforms more complex models that mix convolutional, recurrent and attentive layers. KWT can be used as a drop-in replacement for these models, setting two new benchmark records on the Google Speech Commands dataset with 98.6% and 97.7% accuracy on the 12 and 35-command tasks respectively.
Authors
(none)
Tags
Stats
Related papers
- Exploring Sequence-to-sequence Transformer-transducer Models For Keyword Spotting (2022)5.24
- Visual Keyword Spotting With Attention (2021)2.26
- Audiomer: A Convolutional Transformer For Keyword Spotting (2021)0.00
- Separable Temporal Convolution Plus Temporally Pooled Attention For Lightweight High-performance Keyword Spotting (2021)0.00
- Efficient Keyword Spotting By Capturing Long-range Interactions With Temporal Lambda Networks (2021)0.00
- Self-supervised Speech Representation Learning For Keyword-spotting With Light-weight Transformers (2023)0.00
- A Separable Temporal Convolution Neural Network With Attention For Small-footprint Keyword Spotting (2021)0.00
- Small-footprint Keyword Spotting With Multi-scale Temporal Convolution (2020)0.00