Expressive TTS Driven By Natural Language Prompts Using Few Human Annotations
2023 Β· Hanglei Zhang, Yiwei Guo, Sen Liu, et al.
Abstract
Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes. Recent advancements in expressive TTS empower users with the ability to directly control synthesis style through natural language prompts. However, these methods often require excessive training with a significant amount of style-annotated data, which can be challenging to acquire. Moreover, they may have limited adaptability due to fixed style annotations. In this work, we present FreeStyleTTS (FS-TTS), a controllable expressive TTS model with minimal human annotations. Our approach utilizes a large language model (LLM) to transform expressive TTS into a style retrieval task. The LLM selects the best-matching style references from annotated utterances based on external style prompts, which can be raw input text or natural language style descriptions. The selected reference guides the TTS pipeline to synthesize speeches with the intended style. This innovative approach
Authors
(none)
Tags
Stats
Related papers
- Instructtts: Modelling Expressive TTS In Discrete Latent Space With Natural Language Style Prompt (2023)0.00
- Expressive Text-to-speech Using Style Tag (2021)10.85
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Self-supervised Context-aware Style Representation For Expressive Speech Synthesis (2022)6.34
- Stylespeech: Parameter-efficient Fine Tuning For Pre-trained Controllable Text-to-speech (2024)6.34
- Stylefusion TTS: Multimodal Style-control And Enhanced Feature Fusion For Zero-shot Text-to-speech Synthesis (2024)6.34
- Spontaneous Style Text-to-speech Synthesis With Controllable Spontaneous Behaviors Based On Language Models (2024)7.81
- Low-resource Expressive Text-to-speech Using Data Augmentation (2020)11.29