Using Vaes And Normalizing Flows For One-shot Text-to-speech Synthesis Of Expressive Speech
2019 · Vatsal Aggarwal, Marius Cotescu, Nishant Prateek, et al.
Abstract
We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second. Specifically, we enhance the disentanglement capabilities of a state-of-the-art sequence-to-sequence based system with a Variational AutoEncoder (VAE) and a Householder Flow. The proposed system provides a 22% KL-divergence reduction while jointly improving perceptual metrics over state-of-the-art. At synthesis time we use one example of expressive style as a reference input to the encoder for generating any text in the desired style. Perceptual MUSHRA evaluations show that we can create a voice with a 9% relative naturalness improvement over standard Neural Text-to-Speech, while also improving the perceived emotional intensity (59 compared to the 55 of neutral speech).
Authors
(none)
Tags
Stats
Related papers
- Improving Multi-speaker TTS Prosody Variance With A Residual Encoder And Normalizing Flows (2021)0.00
- Expressive Speech Synthesis Via Modeling Expressions With Variational Autoencoder (2018)13.88
- Learning Latent Representations For Style Control And Transfer In End-to-end Speech Synthesis (2018)0.00
- Stylespeech: Self-supervised Style Enhancing With Vq-vae-based Pre-training For Expressive Audiobook Speech Synthesis (2023)7.16
- Style-label-free: Cross-speaker Style Transfer By Quantized VAE And Speaker-wise Normalization In Speech Synthesis (2022)4.52
- Conditional Variational Autoencoder With Adversarial Learning For End-to-end Text-to-speech (2021)0.00
- Cross-utterance Conditioned VAE For Speech Generation (2023)5.84
- Low-resource Expressive Text-to-speech Using Data Augmentation (2020)11.29