Context-aware Coherent Speaking Style Prediction With Hierarchical Transformers For Audiobook Speech Synthesis
2023 Β· Shun Lei, Yixuan Zhou, Liyang Chen, et al.
Abstract
Recent advances in text-to-speech have significantly improved the expressiveness of synthesized speech. However, it is still challenging to generate speech with contextually appropriate and coherent speaking style for multi-sentence text in audiobooks. In this paper, we propose a context-aware coherent speaking style prediction method for audiobook speech synthesis. To predict the style embedding of the current utterance, a hierarchical transformer-based context-aware style predictor with a mixture attention mask is designed, considering both text-side context information and speech-side style information of previous speeches. Based on this, we can generate long-form speech with coherent style and prosody sentence by sentence. Objective and subjective evaluations on a Mandarin audiobook dataset demonstrate that our proposed model can generate speech with more expressive and coherent speaking style than baselines, for both single-sentence and multi-sentence test.
Authors
(none)
Tags
Stats
Related papers
- Msstyletts: Multi-scale Style Modeling With Hierarchical Context Information For Expressive Speech Synthesis (2023)6.77
- Towards Expressive Speaking Style Modelling With Hierarchical Context Information For Mandarin Speech Synthesis (2022)6.34
- Self-supervised Context-aware Style Representation For Expressive Speech Synthesis (2022)6.34
- Improving Prosody For Cross-speaker Style Transfer By Semi-supervised Style Extractor And Hierarchical Modeling In Speech Synthesis (2023)7.50
- Hierarchical Prosody Modeling For Non-autoregressive Speech Synthesis (2020)10.07
- Stylespeech: Self-supervised Style Enhancing With Vq-vae-based Pre-training For Expressive Audiobook Speech Synthesis (2023)7.16
- Improving Speech Prosody Of Audiobook Text-to-speech Synthesis With Acoustic And Textual Contexts (2022)7.81
- Hierarchical Context-aware Transformers For Non-autoregressive Text To Speech (2021)5.24