Gigaspeech 2: An Evolving, Large-scale And Multi-domain ASR Corpus For Low-resource Languages With Automated Crawling, Transcription And Refinement
2024 Β· Yifan Yang, Zheshu Song, Jianheng Zhuo, et al.
Abstract
The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired speech and text data. GigaSpeech 2 comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese, gathered from unlabeled YouTube videos. We also introduce an automated pipeline for data crawling, transcription, and label refinement. Specifically, this pipeline involves Whisper for initial transcription, MMS for forced alignment, and multi-dimensional filtering for data quality assurance. A modified Noisy Student Training is developed to further refine flawed pseudo labels iteratively, thereby enhancing model performance. Experimental results on our m
Authors
(none)
Tags
Stats
Related papers
- MSR-86K: An Evolving, Multilingual Corpus With 86,300 Hours Of Transcribed Audio For Speech Recognition Research (2024)4.52
- Google Crowdsourced Speech Corpora And Related Open-source Resources For Low-resource Languages And Dialects: An Overview (2020)0.00
- Generative Adversarial Training Data Adaptation For Very Low-resource Automatic Speech Recognition (2020)6.77
- The Greek Podcast Corpus: Competitive Speech Models For Low-resourced Languages With Weakly Supervised Data (2024)0.00
- Indicvoices-r: Unlocking A Massive Multilingual Multi-speaker Speech Corpus For Scaling Indian TTS (2024)2.26
- Jtubespeech: Corpus Of Japanese Speech Collected From Youtube For Speech Recognition And Speaker Verification (2021)0.00
- Bigssl: Exploring The Frontier Of Large-scale Semi-supervised Learning For Automatic Speech Recognition (2021)15.73
- Frustratingly Easy Data Augmentation For Low-resource ASR (2025)0.00