Phone-level Prosody Modelling With Gmm-based MDN For Diverse And Controllable Speech Synthesis
2021 Β· Chenpeng Du, Kai Yu
Abstract
Generating natural speech with a diverse and smooth prosody pattern is a challenging task. Although random sampling with phone-level prosody distribution has been investigated to generate different prosody patterns, the diversity of the generated speech is still very limited and far from what can be achieved by humans. This is largely due to the use of uni-modal distribution, such as single Gaussian, in the prior works of phone-level prosody modelling. In this work, we propose a novel approach that models phone-level prosodies with a GMM-based mixture density network(MDN) and then extend it for multi-speaker TTS using speaker adaptation transforms of Gaussian means and variances. Furthermore, we show that we can clone the prosodies from a reference speech by sampling prosodies from the Gaussian components that produce the reference prosodies. Our experiments on LJSpeech and LibriTTS dataset show that the proposed method with GMM-based MDN not only achieves significantly better diversit
Authors
(none)
Tags
Stats
Related papers
- Rich Prosody Diversity Modelling With Phone-level Mixture Density Network (2021)8.35
- DPP-TTS: Diversifying Prosodic Features Of Speech Via Determinantal Point Processes (2023)0.00
- Prosody-controllable Spontaneous TTS With Neural Hmms (2022)8.09
- Using Generative Modelling To Produce Varied Intonation For Speech Synthesis (2019)7.81
- Diffprosody: Diffusion-based Latent Prosody Generation For Expressive Speech Synthesis With Prosody Conditional Adversarial Training (2023)10.07
- Unsupervised Word-level Prosody Tagging For Controllable Speech Synthesis (2022)7.16
- Diffstyletts: Diffusion-based Hierarchical Prosody Modeling For Text-to-speech With Diverse And Controllable Styles (2024)0.00
- Msdtron: A High-capability Multi-speaker Speech Synthesis System For Diverse Data Using Characteristic Information (2021)4.52