Intra- And Inter-modal Context Interaction Modeling For Conversational Speech Synthesis
2024 Β· Zhenqi Jia, Rui Liu
Abstract
Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance. The key challenge of CSS is to model the interaction between the MDH and the target utterance. Note that text and speech modalities in MDH have their own unique influences, and they complement each other to produce a comprehensive impact on the target utterance. Previous works did not explicitly model such intra-modal and inter-modal interactions. To address this issue, we propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed III-CSS. Specifically, in the training phase, we combine the MDH with the text and speech modalities in the target utterance to obtain four modal combinations, including Historical Text-Next Text, Historical Speech-Next Speech, Historical Text-Next Speech, and Historical Speech-Next Text. Then, we design two contrastive learning-based intra-modal
Authors
(none)
Tags
Stats
Related papers
- CONCSS: Contrastive-based Context Comprehension For Dialogue-appropriate Prosody In Conversational Speech Synthesis (2023)0.00
- Diffcss: Diverse And Expressive Conversational Speech Synthesis With Diffusion Models (2025)0.00
- Emotion Rendering For Conversational Speech Synthesis With Heterogeneous Graph-based Context Modeling (2023)13.15
- Enhancing Speaking Styles In Conversational Text-to-speech Synthesis With Graph-based Multi-modal Context Modeling (2021)0.00
- Leveraging Real Conversational Data For Multi-channel Continuous Speech Separation (2022)0.00
- Leveraging Acoustic Contextual Representation By Audio-textual Cross-modal Learning For Conversational ASR (2022)0.00
- Towards Expressive Video Dubbing With Multiscale Multimodal Context Interaction (2024)4.52
- M2-CTTS: End-to-end Multi-scale Multi-modal Conversational Text-to-speech Synthesis (2023)8.35