Performance-efficiency Trade-offs In Unsupervised Pre-training For Speech Recognition
2021 Β· Felix Wu, Kwangyoun Kim, Jing Pan, et al.
Abstract
This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25-50% across different model sizes.
Authors
(none)
Tags
Stats
Related papers
- Wav2vec-s: Semi-supervised Pre-training For Low-resource ASR (2021)7.50
- A Noise-robust Self-supervised Pre-training Model Based Speech Representation Learning For Automatic Speech Recognition (2022)11.19
- Wav2vec: Unsupervised Pre-training For Speech Recognition (2019)0.00
- Self-supervised Rewiring Of Pre-trained Speech Encoders: Towards Faster Fine-tuning With Less Labels In Speech Processing (2022)3.58
- Wav2vec 2.0: A Framework For Self-supervised Learning Of Speech Representations (2020)0.00
- Efficient Infusion Of Self-supervised Representations In Automatic Speech Recognition (2024)0.00
- Improving Low-resource Speech Recognition With Pretrained Speech Models: Continued Pretraining Vs. Semi-supervised Training (2022)0.00
- Self-training And Pre-training Are Complementary For Speech Recognition (2020)14.15