Merlion CCS Challenge: A English-mandarin Code-switching Child-directed Speech Corpus For Language Identification And Diarization
2023 Β· Victoria Y. H. Chua, Hexin Liu, Leibny Paola Garcia Perera, et al.
Abstract
To enhance the reliability and robustness of language identification (LID) and language diarization (LD) systems for heterogeneous populations and scenarios, there is a need for speech processing models to be trained on datasets that feature diverse language registers and speech patterns. We present the MERLIon CCS challenge, featuring a first-of-its-kind Zoom video call dataset of parent-child shared book reading, of over 30 hours with over 300 recordings, annotated by multilingual transcribers using a high-fidelity linguistic transcription protocol. The audio corpus features spontaneous and in-the-wild English-Mandarin code-switching, child-directed speech in non-standard accents with diverse language-mixing patterns recorded in a variety of home environments. This report describes the corpus, as well as LID and LD results for our baseline and several systems submitted to the MERLIon CCS challenge using the corpus.
Authors
(none)
Tags
Stats
Related papers
- Spoken Language Identification System For English-mandarin Code-switching Child-directed Speech (2023)4.52
- TALCS: An Open-source Mandarin-english Code-switching Corpus And A Speech Recognition Baseline (2022)5.84
- Meralion-audiollm: Bridging Audio And Language With Large Language Models (2024)0.00
- Developing A Multilingual Dataset And Evaluation Metrics For Code-switching: A Focus On Hong Kong's Polylingual Dynamics (2023)0.00
- Advancing Singlish Understanding: Bridging The Gap With Datasets And Multimodal Models (2025)0.00
- Challenging The Boundaries Of Speech Recognition: The MALACH Corpus (2019)7.16
- Switchlingua: The First Large-scale Multilingual And Multi-ethnic Code-switching Dataset (2025)0.00
- The SLT 2021 Children Speech Recognition Challenge: Open Datasets, Rules And Baselines (2020)8.60