Textrolspeech: A Text Style Control Speech Corpus With Codec Language Text-to-speech Models
2023 Β· Shengpeng Ji, Jialong Zuo, Minghui Fang, et al.
Abstract
Recently, there has been a growing interest in the field of controllable Text-to-Speech (TTS). While previous studies have relied on users providing specific style factor values based on acoustic knowledge or selecting reference speeches that meet certain requirements, generating speech solely from natural text prompts has emerged as a new challenge for researchers. This challenge arises due to the scarcity of high-quality speech datasets with natural text style prompt and the absence of advanced text-controllable TTS models. In light of this, 1) we propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes. The dataset comprises 236,220 pairs of style prompt in natural text descriptions with five style factors and corresponding speech samples. Through iterative experimentation, we introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes. 2
Authors
(none)
Tags
Stats
Related papers
- Instructtts: Modelling Expressive TTS In Discrete Latent Space With Natural Language Style Prompt (2023)0.00
- Libritts-p: A Corpus With Speaking Style And Speaker Identity Prompts For Text-to-speech And Style Captioning (2024)11.91
- Promptstyle: Controllable Style Transfer For Text-to-speech With Natural Language Descriptions (2023)10.85
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- MM-TTS: Multi-modal Prompt Based Style Transfer For Expressive Text-to-speech Synthesis (2023)8.60
- Expressive TTS Driven By Natural Language Prompts Using Few Human Annotations (2023)0.00
- Text-driven Emotional Style Control And Cross-speaker Style Transfer In Neural TTS (2022)7.81
- Expressive Text-to-speech Using Style Tag (2021)10.85