On Scaling Contrastive Representations For Low-resource Speech Recognition
2021 Β· Lasse Borgholt, Tycho Max Sylvester Tax, Jakob Drachmann Havtorn, et al.
Abstract
Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning by training a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework. We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In addition, we find that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer. Finally, we propose a bidirectional extension to the original wav2vec framework that consistently improves performance.
Authors
(none)
Tags
Stats
Related papers
- Wav2vec 2.0: A Framework For Self-supervised Learning Of Speech Representations (2020)0.00
- Ccc-wav2vec 2.0: Clustering Aided Cross Contrastive Self-supervised Learning Of Speech Representations (2022)7.81
- A Noise-robust Self-supervised Pre-training Model Based Speech Representation Learning For Automatic Speech Recognition (2022)11.19
- Wav2vec: Unsupervised Pre-training For Speech Recognition (2019)0.00
- Self-supervised Rewiring Of Pre-trained Speech Encoders: Towards Faster Fine-tuning With Less Labels In Speech Processing (2022)3.58
- Unsupervised Speech Recognition (2021)0.00
- Wav2vec-switch: Contrastive Learning From Original-noisy Speech Pairs For Robust Speech Recognition (2021)12.93
- Whisper Turns Stronger: Augmenting Wav2vec 2.0 For Superior ASR In Low-resource Languages (2024)0.00