Comedicspeech: Text To Speech For Stand-up Comedies In Low-resource Scenarios
2023 Β· Yuyue Wang, Huan Xiao, Yihan Wu, et al.
Abstract
Text to Speech (TTS) models can generate natural and high-quality speech, but it is not expressive enough when synthesizing speech with dramatic expressiveness, such as stand-up comedies. Considering comedians have diverse personal speech styles, including personal prosody, rhythm, and fillers, it requires real-world datasets and strong speech style modeling capabilities, which brings challenges. In this paper, we construct a new dataset and develop ComedicSpeech, a TTS system tailored for the stand-up comedy synthesis in low-resource scenarios. First, we extract prosody representation by the prosody encoder and condition it to the TTS model in a flexible way. Second, we enhance the personal rhythm modeling by a conditional duration predictor. Third, we model the personal fillers by introducing comedian-related special tokens. Experiments show that ComedicSpeech achieves better expressiveness than baselines with only ten-minute training data for each comedian. The audio samples are ava
Authors
(none)
Tags
Stats
Related papers
- Low-resource Expressive Text-to-speech Using Data Augmentation (2020)11.29
- Non-autoregressive TTS With Explicit Duration Modelling For Low-resource Highly Expressive Speech (2021)8.82
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Instructtts: Modelling Expressive TTS In Discrete Latent Space With Natural Language Style Prompt (2023)0.00
- Contextspeech: Expressive And Efficient Text-to-speech For Paragraph Reading (2023)5.84
- Spontaneous Style Text-to-speech Synthesis With Controllable Spontaneous Behaviors Based On Language Models (2024)7.81
- Textrolspeech: A Text Style Control Speech Corpus With Codec Language Text-to-speech Models (2023)9.59
- Stable-tts: Stable Speaker-adaptive Text-to-speech Synthesis Via Prosody Prompting (2024)4.52