Semi-supervised Generative Modeling For Controllable Speech Synthesis
2019 Β· Raza Habib, Soroosh Mariooryad, Matt Shannon, et al.
Abstract
We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to force them to take on consistent and interpretable purposes, which previously hasn't been possible with purely unsupervised TTS models. We demonstrate that our model is able to reliably discover and control important but rarely labelled attributes of speech, such as affect and speaking rate, with as little as 1% (30 minutes) supervision. Even at such low supervision levels we do not observe a degradation of synthesis quality compared to a state-of-the-art baseline. Audio samples are available on the web.
Authors
(none)
Tags
Stats
Related papers
- Hierarchical Generative Modeling For Controllable Speech Synthesis (2018)0.00
- Controllable Prosody Generation With Partial Inputs (2023)3.58
- Semi-supervised Learning For Continuous Emotional Intensity Controllable Speech Synthesis With Disentangled Representations (2022)0.00
- Hierarchical Multi-grained Generative Model For Expressive Speech Synthesis (2020)8.60
- Spontaneous Style Text-to-speech Synthesis With Controllable Spontaneous Behaviors Based On Language Models (2024)7.81
- Semi-supervised Learning For Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation (2020)5.24
- Continuous Autoregressive Modeling With Stochastic Monotonic Alignment For Speech Synthesis (2025)0.00
- Towards Spontaneous Style Modeling With Semi-supervised Pre-training For Conversational Text-to-speech Synthesis (2023)4.52