Exploiting Deep Sentential Context For Expressive End-to-end Speech Synthesis
2020 Β· Fengyu Yang, Shan Yang, Qinghua Wu, et al.
Abstract
Attention-based seq2seq text-to-speech systems, especially those use self-attention networks (SAN), have achieved state-of-art performance. But an expressive corpus with rich prosody is still challenging to model as 1) prosodic aspects, which span across different sentential granularities and mainly determine acoustic expressiveness, are difficult to quantize and label and 2) the current seq2seq framework extracts prosodic information solely from a text encoder, which is easily collapsed to an averaged expression for expressive contents. In this paper, we propose a context extractor, which is built upon SAN-based text encoder, to sufficiently exploit the sentential context over an expressive corpus for seq2seq-based TTS. Our context extractor first collects prosodic-related sentential context information from different SAN layers and then aggregates them to learn a comprehensive sentence representation to enhance the expressiveness of the final generated speech. Specifically, we invest
Authors
(none)
Tags
Stats
Related papers
- Investigating Context Features Hidden In End-to-end TTS (2018)0.00
- Self-supervised Context-aware Style Representation For Expressive Speech Synthesis (2022)6.34
- Contextspeech: Expressive And Efficient Text-to-speech For Paragraph Reading (2023)5.84
- Simple And Effective Multi-sentence TTS With Expressive And Coherent Prosody (2022)7.16
- Fctalker: Fine And Coarse Grained Context Modeling For Expressive Conversational Speech Synthesis (2022)2.86
- Using Previous Acoustic Context To Improve Text-to-speech Synthesis (2020)0.00
- Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers (2021)10.07
- Improving Speech Prosody Of Audiobook Text-to-speech Synthesis With Acoustic And Textual Contexts (2022)7.81