Wenetspeech4tts: A 12,800-hour Mandarin TTS Corpus For Large Speech Generation Model Benchmark
2024 Β· Linhan Ma, Dake Guo, Kun Song, et al.
Abstract
With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains \(12,800\) hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corres
Authors
(none)
Tags
Stats
Related papers
- Wenetspeech: A 10000+ Hours Multi-domain Mandarin Corpus For Speech Recognition (2021)16.12
- AISHELL-3: A Multi-speaker Mandarin TTS Corpus And The Baselines (2020)0.00
- BASE TTS: Lessons From Building A Billion-parameter Text-to-speech Model On 100K Hours Of Data (2024)0.00
- Text-to-speech Synthesis In The Wild (2024)0.00
- Towards Natural Bilingual And Code-switched Speech Synthesis Based On Mix Of Monolingual Recordings And Cross-lingual Voice Conversion (2020)0.00
- Mntts2: An Open-source Multi-speaker Mongolian Text-to-speech Synthesis Dataset (2022)5.81
- Mntts: An Open-source Mongolian Text-to-speech Synthesis Dataset And Accompanied Baseline (2022)5.24
- Bailing-tts: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation (2024)0.00