Wenetspeech: A 10000+ Hours Multi-domain Mandarin Corpus For Speech Recognition
2021 Β· Binbin Zhang, Hang Lv, Pengcheng Guo, et al.
Abstract
In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation -- Dev for cross-validation purpose in training, Test_Net, collected from Internet for matched test, and Test\_Meeting, recorded
Authors
(none)
Tags
Stats
Related papers
- Wenetspeech4tts: A 12,800-hour Mandarin TTS Corpus For Large Speech Generation Model Benchmark (2024)9.76
- TALCS: An Open-source Mandarin-english Code-switching Corpus And A Speech Recognition Baseline (2022)5.84
- Mmspeech: Multi-modal Multi-task Encoder-decoder Pre-training For Speech Recognition (2022)6.34
- MSR-86K: An Evolving, Multilingual Corpus With 86,300 Hours Of Transcribed Audio For Speech Recognition Research (2024)4.52
- Voxlingua107: A Dataset For Spoken Language Recognition (2020)14.15
- Attention-based End-to-end Speech Recognition On Voice Search (2017)0.00
- The People's Speech: A Large-scale Diverse English Speech Recognition Dataset For Commercial Usage (2021)0.00
- Merlion CCS Challenge: A English-mandarin Code-switching Child-directed Speech Corpus For Language Identification And Diarization (2023)0.00