Hierarchical Generative Modeling For Controllable Speech Synthesis
2018 Β· Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, et al.
Abstract
This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model based on the variational autoencoder (VAE) framework, with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation demonstrates its ability to control the aforementioned attributes. In particular, we train a high-quality con
Authors
(none)
Tags
Stats
Related papers
- Hierarchical Multi-grained Generative Model For Expressive Speech Synthesis (2020)8.60
- Semi-supervised Generative Modeling For Controllable Speech Synthesis (2019)0.00
- Hierarchical Prosody Modeling And Control In Non-autoregressive Parallel Neural TTS (2021)8.35
- Using Generative Modelling To Produce Varied Intonation For Speech Synthesis (2019)7.81
- Foundationtts: Text-to-speech For ASR Customization With Generative Language Model (2023)0.00
- Fully-hierarchical Fine-grained Prosody Modeling For Interpretable Speech Synthesis (2020)13.28
- Continuous Autoregressive Modeling With Stochastic Monotonic Alignment For Speech Synthesis (2025)0.00
- Text-to-speech Synthesis Based On Latent Variable Conversion Using Diffusion Probabilistic Model And Variational Autoencoder (2022)0.00