Self-supervised Speech Representation Learning For Keyword-spotting With Light-weight Transformers
2023 Β· Chenyang Gao, Yue Gu, Francesco Caliva, et al.
Abstract
Self-supervised speech representation learning (S3RL) is revolutionizing the way we leverage the ever-growing availability of data. While S3RL related studies typically use large models, we employ light-weight networks to comply with tight memory of compute-constrained devices. We demonstrate the effectiveness of S3RL on a keyword-spotting (KS) problem by using transformers with 330k parameters and propose a mechanism to enhance utterance-wise distinction, which proves crucial for improving performance on classification tasks. On the Google speech commands v2 dataset, the proposed method applied to the Auto-Regressive Predictive Coding S3RL led to a 1.2% accuracy improvement compared to training from scratch. On an in-house KS dataset with four different keywords, it provided 6% to 23.7% relative false accept improvement at fixed false reject rate. We argue this demonstrates the applicability of S3RL approaches to light-weight models for KS and confirms S3RL is a powerful alternative t
Authors
(none)
Tags
Stats
Related papers
- Exploring Representation Learning For Small-footprint Keyword Spotting (2023)3.58
- Keyword Transformer: A Self-attention Model For Keyword Spotting (2021)15.31
- A Low Latency Attention Module For Streaming Self-supervised Speech Representation Learning (2023)0.00
- Application Of Knowledge Distillation To Multi-task Speech Representation Learning (2022)2.26
- Self-supervised Rewiring Of Pre-trained Speech Encoders: Towards Faster Fine-tuning With Less Labels In Speech Processing (2022)3.58
- Learning Efficient Representations For Keyword Spotting With Triplet Loss (2021)11.76
- Contrastive Augmentation: An Unsupervised Learning Approach For Keyword Spotting In Speech Technology (2024)9.92
- Transformers With Convolutional Context For ASR (2019)0.00