Natural Language Guidance Of High-fidelity Text-to-speech With Synthetic Annotations
2024 Β· Dan Lyth, Simon King
Abstract
Text-to-speech models trained on large-scale datasets have demonstrated impressive in-context learning capabilities and naturalness. However, control of speaker identity and style in these models typically requires conditioning on reference speech recordings, limiting creative applications. Alternatively, natural language prompting of speaker identity and style has demonstrated promising results and provides an intuitive method of control. However, reliance on human-labeled descriptions prevents scaling to large datasets. Our work bridges the gap between these two approaches. We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions. We then apply this method to a 45k hour dataset, which we use to train a speech language model. Furthermore, we propose simple methods for increasing audio fidelity, significantly outperforming recent work despite relying entirely on found data. Our results demonstrate high-fidelity speech generation in
Authors
(none)
Tags
Stats
Related papers
- Expressive TTS Driven By Natural Language Prompts Using Few Human Annotations (2023)0.00
- Spontaneous Style Text-to-speech Synthesis With Controllable Spontaneous Behaviors Based On Language Models (2024)7.81
- Controllable Generation Of Artificial Speaker Embeddings Through Discovery Of Principal Directions (2023)0.00
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Self-supervised Context-aware Style Representation For Expressive Speech Synthesis (2022)6.34
- Low-resource Expressive Text-to-speech Using Data Augmentation (2020)11.29
- Prompttts++: Controlling Speaker Identity In Prompt-based Text-to-speech Using Natural Language Descriptions (2023)9.23
- Instructtts: Modelling Expressive TTS In Discrete Latent Space With Natural Language Style Prompt (2023)0.00