Switchlingua: The First Large-scale Multilingual And Multi-ethnic Code-switching Dataset
2025 Β· Peng Xie, Xingyuan Liu, Tsz Wai Chan, et al.
Abstract
Code-switching (CS) is the alternating use of two or more languages within a conversation or utterance, often influenced by social context and speaker identity. This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems, which are typically designed for a single language and struggle to handle multilingual inputs. The growing global demand for multilingual applications, including Code-Switching ASR (CSASR), Text-to-Speech (CSTTS), and Cross-Lingual Information Retrieval (CLIR), highlights the inadequacy of existing monolingual datasets. Although some code-switching datasets exist, most are limited to bilingual mixing within homogeneous ethnic groups, leaving a critical need for a large-scale, diverse benchmark akin to ImageNet in computer vision. To bridge this gap, we introduce \textbf\{LinguaMaster\}, a multi-agent collaboration framework specifically designed for efficient and scalable multilingual data synthesis. Leveraging this framework, we cur
Authors
(none)
Tags
Stats
Related papers
- Code-switching Speech Recognition Under The Lens: Model- And Data-centric Perspectives (2025)0.00
- Enhancing Code-switched Text-to-speech Synthesis Capability In Large Language Models With Only Monolingual Corpora (2024)0.00
- Language-agnostic Code-switching In Sequence-to-sequence Speech Recognition (2022)0.00
- Unified Model For Code-switching Speech Recognition And Language Identification Based On A Concatenated Tokenizer (2023)8.09
- The ASRU 2019 Mandarin-english Code-switching Speech Recognition Challenge: Open Datasets, Tracks, Methods And Results (2020)0.00
- Developing A Multilingual Dataset And Evaluation Metrics For Code-switching: A Focus On Hong Kong's Polylingual Dynamics (2023)0.00
- Integrating Knowledge In End-to-end Automatic Speech Recognition For Mandarin-english Code-switching (2021)5.24
- Exploring Retraining-free Speech Recognition For Intra-sentential Code-switching (2021)5.84