Bailing-tts: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation
2024 Β· Xinhan di, Zihao Chen, Yunming Liang, et al.
Abstract
Large-scale text-to-speech (TTS) models have made significant progress recently.However, they still fall short in the generation of Chinese dialectal speech. Toaddress this, we propose Bailing-TTS, a family of large-scale TTS models capable of generating high-quality Chinese dialectal speech. Bailing-TTS serves as a foundation model for Chinese dialectal speech generation. First, continual semi-supervised learning is proposed to facilitate the alignment of text tokens and speech tokens. Second, the Chinese dialectal representation learning is developed using a specific transformer architecture and multi-stage training processes. With the proposed design of novel network architecture and corresponding strategy, Bailing-TTS is able to generate Chinese dialectal speech from text effectively and efficiently. Experiments demonstrate that Bailing-TTS generates Chinese dialectal speech towards human-like spontaneous representation. Readers are encouraged to listen to demos at https://c9412600
Authors
(none)
Tags
Stats
Related papers
- Towards Natural Bilingual And Code-switched Speech Synthesis Based On Mix Of Monolingual Recordings And Cross-lingual Voice Conversion (2020)0.00
- Breezyvoice: Adapting TTS For Taiwanese Mandarin With Enhanced Polyphone Disambiguation -- Challenges And Insights (2025)0.00
- Building A Mixed-lingual Neural TTS System With Only Monolingual Data (2019)0.00
- TMD-TTS: A Unified Tibetan Multi-dialect Text-to-speech Framework For \"u-tsang, Amdo And Kham Speech Dataset Generation (2025)0.00
- Cross-dialect Text-to-speech In Pitch-accent Language Incorporating Multi-dialect Phoneme-level BERT (2024)3.58
- TDASS: Target Domain Adaptation Speech Synthesis Framework For Multi-speaker Low-resource TTS (2022)0.00
- Spontaneous Style Text-to-speech Synthesis With Controllable Spontaneous Behaviors Based On Language Models (2024)7.81
- Wenetspeech4tts: A 12,800-hour Mandarin TTS Corpus For Large Speech Generation Model Benchmark (2024)9.76