Towards Controllable Speech Synthesis In The Era Of Large Language Models: A Systematic Survey
2024 Β· Tianxin Xie, Yan Rong, Pengfei Zhang, et al.
Abstract
Text-to-speech (TTS) has advanced from generating natural-sounding speech to enabling fine-grained control over attributes like emotion, timbre, and style. Driven by rising industrial demand and breakthroughs in deep learning, e.g., diffusion and large language models (LLMs), controllable TTS has become a rapidly growing research area. This survey provides the first comprehensive review of controllable TTS methods, from traditional control techniques to emerging approaches using natural language prompts. We categorize model architectures, control strategies, and feature representations, while also summarizing challenges, datasets, and evaluations in controllable TTS. This survey aims to guide researchers and practitioners by offering a clear taxonomy and highlighting future directions in this fast-evolving field. One can visit https://github.com/imxtx/awesome-controllabe-speech-synthesis for a comprehensive paper list and updates.
Authors
(none)
Tags
Stats
Code
Related papers
- Evaluating Text-to-speech Synthesis From A Large Discrete Token-based Speech Language Model (2024)0.00
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- Stylespeech: Parameter-efficient Fine Tuning For Pre-trained Controllable Text-to-speech (2024)6.34
- A Methodology For Controlling The Emotional Expressiveness In Synthetic Speech -- A Deep Learning Approach (2019)5.84
- Analysis And Assessment Of Controllability Of An Expressive Deep Learning-based TTS System (2021)5.24
- Visualization And Interpretation Of Latent Spaces For Controlling Expressive Speech Synthesis Through Audio Analysis (2019)10.07
- Spontaneous Style Text-to-speech Synthesis With Controllable Spontaneous Behaviors Based On Language Models (2024)7.81
- PROEMO: Prompt-driven Text-to-speech Synthesis Based On Emotion And Intensity Control (2025)0.00