Emosphere-tts: Emotional Style And Intensity Modeling Via Spherical Emotion Vector For Controllable Emotional Text-to-speech
2024 Β· Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, et al.
Abstract
Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model ability to control emotional style and intensity with high-quality expressive speech.
Authors
(none)
Tags
Stats
Related papers
- Emotional Dimension Control In Language Model-based Text-to-speech: Spanning A Broad Spectrum Of Human Emotions (2024)0.00
- Exploring Speech Style Spaces With Language Models: Emotional TTS Without Emotion Labels (2024)0.00
- Msemotts: Multi-scale Emotion Transfer, Prediction, And Control For Emotional Speech Synthesis (2022)13.97
- Fine-grained Emotional Control Of Text-to-speech: Learning To Rank Inter- And Intra-class Emotion Intensities (2023)6.77
- A Methodology For Controlling The Emotional Expressiveness In Synthetic Speech -- A Deep Learning Approach (2019)5.84
- PROEMO: Prompt-driven Text-to-speech Synthesis Based On Emotion And Intensity Control (2025)0.00
- Text-driven Emotional Style Control And Cross-speaker Style Transfer In Neural TTS (2022)7.81
- Emospeech: A Corpus Of Emotionally Rich And Contextually Detailed Speech Annotations (2024)0.00