Fine-grained Emotional Control Of Text-to-speech: Learning To Rank Inter- And Intra-class Emotion Intensities
2023 · Shijun Wang, Jón Guðnason, Damian Borth
Abstract
State-of-the-art Text-To-Speech (TTS) models are capable of producing high-quality speech. The generated speech, however, is usually neutral in emotional expression, whereas very often one would want fine-grained emotional control of words or phonemes. Although still challenging, the first TTS models have been recently proposed that are able to control voice by manually assigning emotion intensity. Unfortunately, due to the neglect of intra-class distance, the intensity differences are often unrecognizable. In this paper, we propose a fine-grained controllable emotional TTS, that considers both inter- and intra-class distances and be able to synthesize speech with recognizable intensity difference. Our subjective and objective experiments demonstrate that our model exceeds two state-of-the-art controllable TTS models for controllability, emotion expressiveness and naturalness.
Authors
(none)
Tags
Stats
Related papers
- Emotional Speech Synthesis With Rich And Granularized Control (2019)13.39
- Emotional Dimension Control In Language Model-based Text-to-speech: Spanning A Broad Spectrum Of Human Emotions (2024)0.00
- A Methodology For Controlling The Emotional Expressiveness In Synthetic Speech -- A Deep Learning Approach (2019)5.84
- EMORL-TTS: Reinforcement Learning For Fine-grained Emotion Control In Llm-based TTS (2025)0.00
- Emosphere-tts: Emotional Style And Intensity Modeling Via Spherical Emotion Vector For Controllable Emotional Text-to-speech (2024)10.35
- RSET: Remapping-based Sorting Method For Emotion Transfer Speech Synthesis (2024)0.00
- Emodiff: Intensity Controllable Emotional Text-to-speech With Soft-label Guidance (2022)0.00
- Reinforcement Learning For Emotional Text-to-speech Synthesis With Improved Emotion Discriminability (2021)0.00