Enhancing Speaking Styles In Conversational Text-to-speech Synthesis With Graph-based Multi-modal Context Modeling
2021 Β· Jingbei Li, Yi Meng, Chenyi Li, et al.
Abstract
Comparing with traditional text-to-speech (TTS) systems, conversational TTS systems are required to synthesize speeches with proper speaking style confirming to the conversational context. However, state-of-the-art context modeling methods in conversational TTS only model the textual information in context with a recurrent neural network (RNN). Such methods have limited ability in modeling the inter-speaker influence in conversations, and also neglect the speaking styles and the intra-speaker inertia inside each speaker. Inspired by DialogueGCN and its superiority in modeling such conversational influences than RNN based approaches, we propose a graph-based multi-modal context modeling method and adopt it to conversational TTS to enhance the speaking styles of synthesized speeches. Both the textual and speaking style information in the context are extracted and processed by DialogueGCN to model the inter- and intra-speaker influence in conversations. The outputs of DialogueGCN are then
Authors
(none)
Tags
Stats
Related papers
- Msstyletts: Multi-scale Style Modeling With Hierarchical Context Information For Expressive Speech Synthesis (2023)6.77
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Hignn-tts: Hierarchical Prosody Modeling With Graph Neural Networks For Expressive Long-form TTS (2023)5.84
- Fctalker: Fine And Coarse Grained Context Modeling For Expressive Conversational Speech Synthesis (2022)2.86
- Improving The Quality Of Neural TTS Using Long-form Content And Multi-speaker Multi-style Modeling (2022)3.58
- Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios (2021)6.77
- Graphspeech: Syntax-aware Graph Attention Network For Neural Speech Synthesis (2020)7.50
- Emphasis Rendering For Conversational Text-to-speech With Multi-modal Multi-scale Context Modeling (2024)0.00