PROEMO: Prompt-driven Text-to-speech Synthesis Based On Emotion And Intensity Control
2025 Β· Shaozuo Zhang, Ambuj Mehrish, Yingting Li, et al.
Abstract
Speech synthesis has significantly advanced from statistical methods to deep neural network architectures, leading to various text-to-speech (TTS) models that closely mimic human speech patterns. However, capturing nuances such as emotion and style in speech synthesis is challenging. To address this challenge, we introduce an approach centered on prompt-based emotion control. The proposed architecture incorporates emotion and intensity control across multi-speakers. Furthermore, we leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic content. Using embedding emotional cues, regulating intensity levels, and guiding prosodic variations with prompts, our approach infuses synthesized speech with human-like expressiveness and variability. Lastly, we demonstrate the effectiveness of our approach through a systematic exploration of the control mechanisms mentioned above.
Authors
(none)
Tags
Stats
Related papers
- Expressive Prompting: Improving Emotion Intensity And Speaker Consistency In Zero-shot TTS (2024)0.00
- A Methodology For Controlling The Emotional Expressiveness In Synthetic Speech -- A Deep Learning Approach (2019)5.84
- Msemotts: Multi-scale Emotion Transfer, Prediction, And Control For Emotional Speech Synthesis (2022)13.97
- Emo-dpo: Controllable Emotional Speech Synthesis Through Direct Preference Optimization (2024)9.59
- Emomix: Emotion Mixing Via Diffusion Models For Emotional Speech Synthesis (2023)0.00
- UMETTS: A Unified Framework For Emotional Text-to-speech Synthesis With Multimodal Prompts (2024)5.24
- Emotional Dimension Control In Language Model-based Text-to-speech: Spanning A Broad Spectrum Of Human Emotions (2024)0.00
- Emosphere-tts: Emotional Style And Intensity Modeling Via Spherical Emotion Vector For Controllable Emotional Text-to-speech (2024)10.35