Audiomer: A Convolutional Transformer For Keyword Spotting
2021 Β· Surya Kant Sahu, Sai Mitheran, Juhi Kamdar, et al.
Abstract
Transformers have seen an unprecedented rise in Natural Language Processing and Computer Vision tasks. However, in audio tasks, they are either infeasible to train due to extremely large sequence length of audio waveforms or incur a performance penalty when trained on Fourier-based features. In this work, we introduce an architecture, Audiomer, where we combine 1D Residual Networks with Performer Attention to achieve state-of-the-art performance in keyword spotting with raw audio waveforms, outperforming all previous methods while being computationally cheaper and parameter-efficient. Additionally, our model has practical advantages for speech processing, such as inference on arbitrarily long audio clips owing to the absence of positional encoding. The code is available at https://github.com/The-Learning-Machines/Audiomer-PyTorch.
Authors
(none)
Tags
Stats
Code
Related papers
- Keyword Transformer: A Self-attention Model For Keyword Spotting (2021)15.31
- Visual Keyword Spotting With Attention (2021)2.26
- Efficient Large-scale Audio Tagging Via Transformer-to-cnn Knowledge Distillation (2022)17.68
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08
- Efficient Training Of Audio Transformers With Patchout (2021)22.11
- Convmixer: Feature Interactive Convolution With Curriculum Learning For Small Footprint And Noisy Far-field Keyword Spotting (2022)12.61
- Exploring Sequence-to-sequence Transformer-transducer Models For Keyword Spotting (2022)5.24
- Dynamic Convolutional Neural Networks As Efficient Pre-trained Audio Models (2023)0.00