Msstyletts: Multi-scale Style Modeling With Hierarchical Context Information For Expressive Speech Synthesis
2023 Β· Shun Lei, Yixuan Zhou, Liyang Chen, et al.
Abstract
Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embeddings at one single scale from the information within the current sentence. Whereas, context information in neighboring sentences and multi-scale nature of style in human speech are neglected, making it challenging to convert multi-sentence text into natural and expressive speech. In this paper, we propose MSStyleTTS, a style modeling method for expressive speech synthesis, to capture and predict styles at different levels from a wider range of context rather than a sentence. Two sub-modules, including multi-scale style extractor and multi-scale style predictor, are trained together with a FastSpeech 2 based acoustic model. The predictor is designed to explore the hierarchical context information by considering structural relationships in context and predict style embeddings at global-level, sentence-
Authors
(none)
Tags
Stats
Related papers
- Context-aware Coherent Speaking Style Prediction With Hierarchical Transformers For Audiobook Speech Synthesis (2023)5.24
- Self-supervised Context-aware Style Representation For Expressive Speech Synthesis (2022)6.34
- Msemotts: Multi-scale Emotion Transfer, Prediction, And Control For Emotional Speech Synthesis (2022)13.97
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- Towards Expressive Speaking Style Modelling With Hierarchical Context Information For Mandarin Speech Synthesis (2022)6.34
- Enhancing Speaking Styles In Conversational Text-to-speech Synthesis With Graph-based Multi-modal Context Modeling (2021)0.00
- Diffstyletts: Diffusion-based Hierarchical Prosody Modeling For Text-to-speech With Diverse And Controllable Styles (2024)0.00
- MM-TTS: Multi-modal Prompt Based Style Transfer For Expressive Text-to-speech Synthesis (2023)8.60