Hignn-tts: Hierarchical Prosody Modeling With Graph Neural Networks For Expressive Long-form TTS
2023 Β· Dake Guo, Xinfa Zhu, Liumeng Xue, et al.
Abstract
Recent advances in text-to-speech, particularly those based on Graph Neural Networks (GNNs), have significantly improved the expressiveness of short-form synthetic speech. However, generating human-parity long-form speech with high dynamic prosodic variations is still challenging. To address this problem, we expand the capabilities of GNNs with a hierarchical prosody modeling approach, named HiGNN-TTS. Specifically, we add a virtual global node in the graph to strengthen the interconnection of word nodes and introduce a contextual attention mechanism to broaden the prosody modeling scope of GNNs from intra-sentence to inter-sentence. Additionally, we perform hierarchical supervision from acoustic prosody on each node of the graph to capture the prosodic variations with a high dynamic range. Ablation studies show the effectiveness of HiGNN-TTS in learning hierarchical prosody. Both objective and subjective evaluations demonstrate that HiGNN-TTS significantly improves the naturalness and
Authors
(none)
Tags
Stats
Related papers
- Hierarchical Prosody Modeling And Control In Non-autoregressive Parallel Neural TTS (2021)8.35
- Hierarchical Prosody Modeling For Non-autoregressive Speech Synthesis (2020)10.07
- Graphspeech: Syntax-aware Graph Attention Network For Neural Speech Synthesis (2020)7.50
- Prosody-controllable Spontaneous TTS With Neural Hmms (2022)8.09
- Hierarchical And Multi-scale Variational Autoencoder For Diverse And Natural Non-autoregressive Text-to-speech (2022)3.58
- Enhancing Speaking Styles In Conversational Text-to-speech Synthesis With Graph-based Multi-modal Context Modeling (2021)0.00
- Towards Expressive Zero-shot Speech Synthesis With Hierarchical Prosody Modeling (2024)4.52
- Controllable Neural Text-to-speech Synthesis Using Intuitive Prosodic Features (2020)11.76