Whistle: Data-efficient Multilingual And Crosslingual Speech Recognition Via Weakly Phonetic Supervision
2024 · Saierdaer Yusuyin, Te Ma, Hao Huang, et al.
Abstract
There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pretraining with phonetic or graphemic transcription, and self-supervised pretraining. We find that pretraining with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. This paper explores the approach of pretraining with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models. We construct a common experimental setup based on the CommonVoice dataset, called CV-Lang10, with 10 seen languages and 2 unseen languages. A set of experiments are conducted on CV-Lang10 to compare, as fair as possible, the three approaches under the co
Authors
(none)
Tags
Stats
Related papers
- Low-resourced Speech Recognition For Iu Mien Language Via Weakly-supervised Phoneme-based Multilingual Pre-training (2024)0.00
- Investigating Zero-shot Generalizability On Mandarin-english Code-switched ASR And Speech-to-text Translation Of Recent Foundation Models With Self-supervision And Weak Supervision (2023)0.00
- Multilingual Distilwhisper: Efficient Distillation Of Multi-task Speech Models Via Language-specific Experts (2023)8.09
- M2r-whisper: Multi-stage And Multi-scale Retrieval Augmentation For Enhancing Whisper (2024)6.77
- XLST: Cross-lingual Self-training To Learn Multilingual Representation For Low Resource Speech Recognition (2021)8.82
- Weighted Cross-entropy For Low-resource Languages In Multilingual Speech Recognition (2024)6.34
- Exploiting Cross-lingual Speaker And Phonetic Diversity For Unsupervised Subword Modeling (2019)6.77
- From Weak Labels To Strong Results: Utilizing 5,000 Hours Of Noisy Classroom Transcripts With Minimal Accurate Data (2025)0.00