CONCSS: Contrastive-based Context Comprehension For Dialogue-appropriate Prosody In Conversational Speech Synthesis
2023 Β· Yayue Deng, Jinlong Xue, Yukang Jia, et al.
Abstract
Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective representation capabilities and context-sensitive discriminability. In this paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within this framework, we define an innovative pretext task specific to CSS that enables the model to perform self-supervised learning on unlabeled conversational datasets to boost the model's context understanding. Additionally, we introduce a sampling strategy for negative sample augmentation to enhance context vectors' discriminability. This is the first attempt to integrate contrastive learning into CSS. We conduct ablation studies on different contrastive learning strategies and comprehensive experiments in comparison with prior CSS systems.
Authors
(none)
Tags
Stats
Related papers
- Diffcss: Diverse And Expressive Conversational Speech Synthesis With Diffusion Models (2025)0.00
- Intra- And Inter-modal Context Interaction Modeling For Conversational Speech Synthesis (2024)4.53
- Emotion Rendering For Conversational Speech Synthesis With Heterogeneous Graph-based Context Modeling (2023)13.15
- Clapspeech: Learning Prosody From Text Context With Contrastive Language-audio Pre-training (2023)0.00
- Leveraging Real Conversational Data For Multi-channel Continuous Speech Separation (2022)0.00
- C3-DINO: Joint Contrastive And Non-contrastive Self-supervised Learning For Speaker Verification (2022)10.21
- Fctalker: Fine And Coarse Grained Context Modeling For Expressive Conversational Speech Synthesis (2022)2.86
- Continual Contrastive Spoken Language Understanding (2023)0.00