Learning Utterance-level Representations Through Token-level Acoustic Latents Prediction For Expressive Speech Synthesis
2022 Β· Karolos Nikitaras, Konstantinos Klapsas, Nikolaos Ellinas, et al.
Abstract
This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitly factorize such fine-grained and utterance-level speech attributes into different representations extracted by modules that operate in the corresponding level. We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations. Therefore, a trade-off arises between the diversity of the token-level and utterance-level representations and their disentanglement. We alleviate this issue by first capturing rich speech attributes into a token-level latent space and then, separately train a prior network that given the input text, learns utterance-level representations in order to predict the phoneme-level, poste
Authors
(none)
Tags
Stats
Related papers
- Uncovering Latent Style Factors For Expressive Speech Synthesis (2017)0.00
- Visualization And Interpretation Of Latent Spaces For Controlling Expressive Speech Synthesis Through Audio Analysis (2019)10.07
- Hierarchical Multi-grained Generative Model For Expressive Speech Synthesis (2020)8.60
- Fully-hierarchical Fine-grained Prosody Modeling For Interpretable Speech Synthesis (2020)13.28
- Controllable Speech Synthesis By Learning Discrete Phoneme-level Prosodic Representations (2022)6.34
- Perception Of Prosodic Variation For Speech Synthesis Using An Unsupervised Discrete Representation Of F0 (2020)7.81
- Predicting Phoneme-level Prosody Latents Using AR And Flow-based Prior Networks For Expressive Speech Synthesis (2022)0.00
- Towards Expressive Zero-shot Speech Synthesis With Hierarchical Prosody Modeling (2024)4.52