Less Is More: Accurate Speech Recognition & Translation Without Web-scale Data
2024 · Krishna C. Puvvada, Piotr Żelasko, He Huang, et al.
Abstract
Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while being trained on an order of magnitude less data than these models. Three key factors enables such data-efficient model: (1) a FastConformer-based attention encoder-decoder architecture (2) training on synthetic data generated with machine translation and (3) advanced training techniques: data-balancing, dynamic data blending, dynamic bucketing and noise-robust fine-tuning. The model, weights, and training code will be open-sourced.
Authors
(none)
Tags
Stats
Related papers
- Leveraging Translations For Speech Transcription In Low-resource Settings (2018)6.77
- Textless Speech-to-speech Translation With Limited Parallel Data (2023)3.58
- Speechformer: Reducing Information Loss In Direct Speech Translation (2021)7.16
- Fine-tuning Whisper On Low-resource Languages For Real-world Applications (2024)0.00
- Exploring The Limits Of Decoder-only Models Trained On Public Speech Recognition Corpora (2024)4.52
- Multilingual Distilwhisper: Efficient Distillation Of Multi-task Speech Models Via Language-specific Experts (2023)8.09
- Whisper-lm: Improving ASR Models With Language Models For Low-resource Languages (2025)3.29
- Efficient Neural Speech Synthesis For Low-resource Languages Through Multilingual Modeling (2020)8.60