EMORL-TTS: Reinforcement Learning For Fine-grained Emotion Control In Llm-based TTS
2025 Β· Haoxun Li, Yu Liu, Yuqing Sun, et al.
Abstract
Recent LLM-based TTS systems achieve strong quality and zero-shot ability, but lack fine-grained emotional control due to their reliance on discrete speech tokens. Existing approaches either limit emotions to categorical labels or cannot generalize to LLM-based architectures. We propose EMORL-TTS (Fine-grained Emotion-controllable TTS with Reinforcement Learning), a framework that unifies global intensity control in the VAD space with local emphasis regulation. Our method combines supervised fine-tuning with reinforcement learning guided by task-specific rewards for emotion category, intensity, and emphasis. Moreover, we further investigate how emphasis placement modulates fine-grained emotion intensity. Experiments show that EMORL-TTS improves emotion accuracy, intensity differentiation, and emphasis clarity, while preserving synthesis quality comparable to strong LLM-based baselines.
Authors
(none)
Tags
Stats
Related papers
- Fine-grained Emotional Control Of Text-to-speech: Learning To Rank Inter- And Intra-class Emotion Intensities (2023)6.77
- Reinforcement Learning For Emotional Text-to-speech Synthesis With Improved Emotion Discriminability (2021)0.00
- PROEMO: Prompt-driven Text-to-speech Synthesis Based On Emotion And Intensity Control (2025)0.00
- Efficient Emotion And Speaker Adaptation In Llm-based TTS Via Characteristic-specific Partial Fine-tuning (2025)0.00
- Emotional Dimension Control In Language Model-based Text-to-speech: Spanning A Broad Spectrum Of Human Emotions (2024)0.00
- Emosphere-tts: Emotional Style And Intensity Modeling Via Spherical Emotion Vector For Controllable Emotional Text-to-speech (2024)10.35
- Emo-dpo: Controllable Emotional Speech Synthesis Through Direct Preference Optimization (2024)9.59
- A Methodology For Controlling The Emotional Expressiveness In Synthetic Speech -- A Deep Learning Approach (2019)5.84