UMETTS: A Unified Framework For Emotional Text-to-speech Synthesis With Multimodal Prompts
2024 Β· Zhi-Qi Cheng, Xiang Li, Jun-Yan He, et al.
Abstract
Emotional Text-to-Speech (E-TTS) synthesis has garnered significant attention in recent years due to its potential to revolutionize human-computer interaction. However, current E-TTS approaches often struggle to capture the intricacies of human emotions, primarily relying on oversimplified emotional labels or single-modality input. In this paper, we introduce the Unified Multimodal Prompt-Induced Emotional Text-to-Speech System (UMETTS), a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. The core of UMETTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align) and the Emotion Embedding-Induced TTS Module (EMI-TTS). (1) EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information. (2) Subsequently, EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS model
Authors
(none)
Tags
Stats
Related papers
- METTS: Multilingual Emotional Text-to-speech By Cross-speaker And Cross-lingual Emotion Transfer (2023)0.00
- Msemotts: Multi-scale Emotion Transfer, Prediction, And Control For Emotional Speech Synthesis (2022)13.97
- PROEMO: Prompt-driven Text-to-speech Synthesis Based On Emotion And Intensity Control (2025)0.00
- MM-TTS: Multi-modal Prompt Based Style Transfer For Expressive Text-to-speech Synthesis (2023)8.60
- EE-TTS: Emphatic Expressive TTS With Linguistic Information (2023)2.26
- Emotional Dimension Control In Language Model-based Text-to-speech: Spanning A Broad Spectrum Of Human Emotions (2024)0.00
- Emomix: Emotion Mixing Via Diffusion Models For Emotional Speech Synthesis (2023)0.00
- Exploring Speech Style Spaces With Language Models: Emotional TTS Without Emotion Labels (2024)0.00