Unsupervised Speech Recognition
2021 Β· Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, et al.
Abstract
Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U, short for wav2vec Unsupervised, a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training. The right representations are key to the success of our method. Compared to the best previous unsupervised work, wav2vec-U reduces the phoneme error rate on the TIMIT benchmark from 26.1 to 11.3. On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago. We also experiment on nine other languages, including low-resource languages such as Kyrgyz,
Authors
(none)
Tags
Stats
Related papers
- Wav2vec 2.0: A Framework For Self-supervised Learning Of Speech Representations (2020)0.00
- Wav2vec: Unsupervised Pre-training For Speech Recognition (2019)0.00
- Towards Unsupervised Speech Recognition Without Pronunciation Models (2024)0.00
- Vq-wav2vec: Self-supervised Learning Of Discrete Speech Representations (2019)0.00
- Self-training And Pre-training Are Complementary For Speech Recognition (2020)14.15
- On Scaling Contrastive Representations For Low-resource Speech Recognition (2021)3.58
- A Noise-robust Self-supervised Pre-training Model Based Speech Representation Learning For Automatic Speech Recognition (2022)11.19
- Analyzing The Robustness Of Unsupervised Speech Recognition (2021)7.81