Spontts: Modeling And Transferring Spontaneous Style For TTS
2023 Β· Hanzhao Li, Xinfa Zhu, Liumeng Xue, et al.
Abstract
Spontaneous speaking style exhibits notable differences from other speaking styles due to various spontaneous phenomena (e.g., filled pauses, prolongation) and substantial prosody variation (e.g., diverse pitch and duration variation, occasional non-verbal speech like a smile), posing challenges to modeling and prediction of spontaneous style. Moreover, the limitation of high-quality spontaneous data constrains spontaneous speech generation for speakers without spontaneous data. To address these problems, we propose SponTTS, a two-stage approach based on neural bottleneck (BN) features to model and transfer spontaneous style for TTS. In the first stage, we adopt a Conditional Variational Autoencoder (CVAE) to capture spontaneous prosody from a BN feature and involve the spontaneous phenomena by the constraint of spontaneous phenomena embedding prediction loss. Besides, we introduce a flow-based predictor to predict a latent spontaneous style representation from the text, which enriches
Authors
(none)
Tags
Stats
Related papers
- Spontaneous Style Text-to-speech Synthesis With Controllable Spontaneous Behaviors Based On Language Models (2024)7.81
- Towards Spontaneous Style Modeling With Semi-supervised Pre-training For Conversational Text-to-speech Synthesis (2023)4.52
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- End-to-end Text-to-speech Based On Latent Representation Of Speaking Styles Using Spontaneous Dialogue (2022)8.35
- Prosody-controllable Spontaneous TTS With Neural Hmms (2022)8.09
- Instructtts: Modelling Expressive TTS In Discrete Latent Space With Natural Language Style Prompt (2023)0.00
- Cross-speaker Style Transfer With Prosody Bottleneck In Neural Speech Synthesis (2021)10.21
- Text-driven Emotional Style Control And Cross-speaker Style Transfer In Neural TTS (2022)7.81