Bigssl: Exploring The Frontier Of Large-scale Semi-supervised Learning For Automatic Speech Recognition
2021 Β· Yu Zhang, Daniel S. Park, Wei Han, et al.
Abstract
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the
Authors
(none)
Tags
Stats
Related papers
- Analyzing The Factors Affecting Usefulness Of Self-supervised Pre-trained Representations For Speech Recognition (2022)0.00
- Deploying Self-supervised Learning In The Wild For Hybrid Automatic Speech Recognition (2022)0.00
- Large Language Model Guided Decoding For Self-supervised Speech Recognition (2025)0.00
- Fine-tuning Strategies For Faster Inference Using Speech Self-supervised Models: A Comparative Study (2023)8.35
- Toward Domain-invariant Speech Recognition Via Large Scale Training (2018)13.39
- Towards Supervised Performance On Speaker Verification With Self-supervised Learning By Leveraging Large-scale ASR Models (2024)7.50
- Unsupervised Automatic Speech Recognition: A Review (2021)13.50
- Large-scale Self-supervised Speech Representation Learning For Automatic Speaker Verification (2021)15.25