Spontaneous Style Text-to-speech Synthesis With Controllable Spontaneous Behaviors Based On Language Models
2024 Β· Weiqin Li, Peiji Yang, Yicheng Zhong, et al.
Abstract
Spontaneous style speech synthesis, which aims to generate human-like speech, often encounters challenges due to the scarcity of high-quality data and limitations in model capabilities. Recent language model-based TTS systems can be trained on large, diverse, and low-quality speech datasets, resulting in highly natural synthesized speech. However, they are limited by the difficulty of simulating various spontaneous behaviors and capturing prosody variations in spontaneous speech. In this paper, we propose a novel spontaneous speech synthesis system based on language models. We systematically categorize and uniformly model diverse spontaneous behaviors. Moreover, fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.Experimental results show that our proposed method significantly outperforms the baseline methods in terms of prosody naturalness and spontaneous behavior naturalness.
Authors
(none)
Tags
Stats
Related papers
- Towards Spontaneous Style Modeling With Semi-supervised Pre-training For Conversational Text-to-speech Synthesis (2023)4.52
- Spontts: Modeling And Transferring Spontaneous Style For TTS (2023)7.50
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Prosody-controllable Spontaneous TTS With Neural Hmms (2022)8.09
- End-to-end Text-to-speech Based On Latent Representation Of Speaking Styles Using Spontaneous Dialogue (2022)8.35
- Instructtts: Modelling Expressive TTS In Discrete Latent Space With Natural Language Style Prompt (2023)0.00
- Expressive TTS Driven By Natural Language Prompts Using Few Human Annotations (2023)0.00
- Style-talker: Finetuning Audio Language Model And Style-based Text-to-speech Model For Fast Spoken Dialogue Generation (2024)0.00