Efficient Keyword Spotting By Capturing Long-range Interactions With Temporal Lambda Networks
2021 Β· Biel Tura, Santiago Escuder, Ferran Diego, et al.
Abstract
Models based on attention mechanisms have shown unprecedented speech recognition performance. However, they are computationally expensive and unnecessarily complex for keyword spotting, a task targeted to small-footprint devices. This work explores the application of Lambda networks, an alternative framework for capturing long-range interactions without attention, for the keyword spotting task. We propose a novel \textit\{ResNet\}-based model by swapping the residual blocks by temporal Lambda layers. Furthermore, the proposed architecture is built upon uni-dimensional temporal convolutions that further reduce its complexity. The presented model does not only reach state-of-the-art accuracies on the Google Speech Commands dataset, but it is 85% and 65% lighter than its Transformer-based (KWT) and convolutional (Res15) counterparts while being up to 100 times faster. To the best of our knowledge, this is the first attempt to explore the Lambda framework within the speech domain and there
Authors
(none)
Tags
Stats
Related papers
- Efficient Keyword Spotting Using Dilated Convolutions And Gating (2018)13.84
- Separable Temporal Convolution Plus Temporally Pooled Attention For Lightweight High-performance Keyword Spotting (2021)0.00
- Small-footprint Keyword Spotting With Graph Convolutional Network (2019)10.48
- A Separable Temporal Convolution Neural Network With Attention For Small-footprint Keyword Spotting (2021)0.00
- Small-footprint Open-vocabulary Keyword Spotting With Quantized LSTM Networks (2020)0.00
- Temporal Convolution For Real-time Keyword Spotting On Mobile Devices (2019)15.67
- Deep Residual Learning For Small-footprint Keyword Spotting (2017)16.21
- Small-footprint Keyword Spotting With Multi-scale Temporal Convolution (2020)0.00