Open Source Magicdata-ramc: A Rich Annotated Mandarin Conversational(ramc) Speech Dataset
2022 Β· Zehui Yang, Yifan Chen, Lei Luo, et al.
Abstract
This paper introduces a high-quality rich annotated Mandarin conversational (RAMC) speech dataset called MagicData-RAMC. The MagicData-RAMC corpus contains 180 hours of conversational speech data recorded from native speakers of Mandarin Chinese over mobile phones with a sampling rate of 16 kHz. The dialogs in MagicData-RAMC are classified into 15 diversified domains and tagged with topic labels, ranging from science and technology to ordinary life. Accurate transcription and precise speaker voice activity timestamps are manually labeled for each sample. Speakers' detailed information is also provided. As a Mandarin speech dataset designed for dialog scenarios with high quality and rich annotations, MagicData-RAMC enriches the data diversity in the Mandarin speech community and allows extensive research on a series of speech-related tasks, including automatic speech recognition, speaker diarization, topic detection, keyword search, text-to-speech, etc. We also conduct several relevant
Authors
(none)
Tags
Stats
Related papers
- MAVD: The First Open Large-scale Mandarin Audio-visual Dataset With Depth Information (2023)6.22
- Mscenespeech: A Multi-scene Speech Dataset For Expressive Speech Synthesis (2024)0.00
- AS-70: A Mandarin Stuttered Speech Dataset For Automatic Speech Recognition And Stuttering Event Detection (2024)0.00
- Audio Caption: Listen And Tell (2019)10.97
- Wenetspeech: A 10000+ Hours Multi-domain Mandarin Corpus For Speech Recognition (2021)16.12
- Sd-eval: A Benchmark Dataset For Spoken Dialogue Understanding Beyond Words (2024)11.32
- Merlion CCS Challenge: A English-mandarin Code-switching Child-directed Speech Corpus For Language Identification And Diarization (2023)0.00
- Libriheavymix: A 20,000-hour Dataset For Single-channel Reverberant Multi-talker Speech Separation, ASR And Speaker Diarization (2024)5.24