ED-TTS: Multi-scale Emotion Modeling Using Cross-domain Emotion Diarization For Emotional Speech Synthesis
2024 Β· Haobin Tang, Xulong Zhang, Ning Cheng, et al.
Abstract
Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training.
Authors
(none)
Tags
Stats
Related papers
- Emotion Controllable Speech Synthesis Using Emotion-unlabeled Dataset With The Assistance Of Cross-domain Speech Emotion Recognition (2020)12.93
- Msemotts: Multi-scale Emotion Transfer, Prediction, And Control For Emotional Speech Synthesis (2022)13.97
- METTS: Multilingual Emotional Text-to-speech By Cross-speaker And Cross-lingual Emotion Transfer (2023)0.00
- Daisy-tts: Simulating Wider Spectrum Of Emotions Via Prosody Embedding Decomposition (2024)0.00
- Boosting Multi-speaker Expressive Speech Synthesis With Semi-supervised Contrastive Learning (2023)5.24
- Speecheq: Speech Emotion Recognition Based On Multi-scale Unified Datasets And Multitask Learning (2022)5.84
- Reinforcement Learning For Emotional Text-to-speech Synthesis With Improved Emotion Discriminability (2021)0.00
- Generative Emotional AI For Speech Emotion Recognition: The Case For Synthetic Emotional Speech Augmentation (2023)11.19