Reinforcement Learning For Emotional Text-to-speech Synthesis With Improved Emotion Discriminability
2021 Β· Rui Liu, Berrak Sisman, Haizhou Li
Abstract
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years. However, the generated voice is often not perceptually identifiable by its intended emotion category. To address this problem, we propose a new interactive training paradigm for ETTS, denoted as i-ETTS, which seeks to directly improve the emotion discriminability by interacting with a speech emotion recognition (SER) model. Moreover, we formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization. Experimental results demonstrate that the proposed i-ETTS outperforms the state-of-the-art baselines by rendering speech with more accurate emotion style. To our best knowledge, this is the first study of reinforcement learning in emotional text-to-speech synthesis.
Authors
(none)
Tags
Stats
Related papers
- EMORL-TTS: Reinforcement Learning For Fine-grained Emotion Control In Llm-based TTS (2025)0.00
- Fine-grained Emotional Control Of Text-to-speech: Learning To Rank Inter- And Intra-class Emotion Intensities (2023)6.77
- Semi-supervised Learning For Continuous Emotional Intensity Controllable Speech Synthesis With Disentangled Representations (2022)0.00
- A Methodology For Controlling The Emotional Expressiveness In Synthetic Speech -- A Deep Learning Approach (2019)5.84
- ED-TTS: Multi-scale Emotion Modeling Using Cross-domain Emotion Diarization For Emotional Speech Synthesis (2024)0.00
- RSET: Remapping-based Sorting Method For Emotion Transfer Speech Synthesis (2024)0.00
- Exploring Speech Style Spaces With Language Models: Emotional TTS Without Emotion Labels (2024)0.00
- Generative Emotional AI For Speech Emotion Recognition: The Case For Synthetic Emotional Speech Augmentation (2023)11.19