Towards Spontaneous Style Modeling With Semi-supervised Pre-training For Conversational Text-to-speech Synthesis
2023 Β· Weiqin Li, Shun Lei, Qiaochu Huang, et al.
Abstract
The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style. However, synthesizing spontaneous-style speech is challenging due to the lack of high-quality spontaneous datasets and the high cost of labeling spontaneous behavior. In this paper, we propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels. In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech. Moreover, a linguistic-aware encoder is used to model the relationship between each sentence in the conversation. Experimental results indicate that our proposed method achieves superior expressive speech synthesis performance with the ability to model spontaneous behavior in spontaneous-style speech and predict reasonable spontaneous behavior from text.
Authors
(none)
Tags
Stats
Related papers
- Spontaneous Style Text-to-speech Synthesis With Controllable Spontaneous Behaviors Based On Language Models (2024)7.81
- Spontts: Modeling And Transferring Spontaneous Style For TTS (2023)7.50
- Self-supervised Context-aware Style Representation For Expressive Speech Synthesis (2022)6.34
- End-to-end Text-to-speech Based On Latent Representation Of Speaking Styles Using Spontaneous Dialogue (2022)8.35
- Semi-supervised Generative Modeling For Controllable Speech Synthesis (2019)0.00
- Natural Language Guidance Of High-fidelity Text-to-speech With Synthetic Annotations (2024)0.00
- Semi-supervised Learning For Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation (2020)5.24
- Prosody-controllable Spontaneous TTS With Neural Hmms (2022)8.09