Expressive Speech Synthesis Via Modeling Expressions With Variational Autoencoder
2018 Β· Kei Akuzawa, Yusuke Iwasawa, Yutaka Matsuo
Abstract
Recent advances in neural autoregressive models have improve the performance of speech synthesis (SS). However, as they lack the ability to model global characteristics of speech (such as speaker individualities or speaking styles), particularly when these characteristics have not been labeled, making neural autoregressive SS systems more expressive is still an open issue. In this paper, we propose to combine VoiceLoop, an autoregressive SS model, with Variational Autoencoder (VAE). This approach, unlike traditional autoregressive SS systems, uses VAE to model the global characteristics explicitly, enabling the expressiveness of the synthesized speech to be controlled in an unsupervised manner. Experiments using the VCTK and Blizzard2012 datasets show the VAE helps VoiceLoop to generate higher quality speech and to control the expressions in its synthesized speech by incorporating global characteristics into the speech generating process.
Authors
(none)
Tags
Stats
Related papers
- Learning Latent Representations For Style Control And Transfer In End-to-end Speech Synthesis (2018)0.00
- Deep Encoder-decoder Models For Unsupervised Learning Of Controllable Speech Synthesis (2018)0.00
- Continuous Autoregressive Modeling With Stochastic Monotonic Alignment For Speech Synthesis (2025)0.00
- Conditional Deep Hierarchical Variational Autoencoder For Voice Conversion (2021)0.00
- Cross-utterance Conditioned VAE For Speech Generation (2023)5.84
- Accented Text-to-speech Synthesis With A Conditional Variational Autoencoder (2022)0.00
- Using Vaes And Normalizing Flows For One-shot Text-to-speech Synthesis Of Expressive Speech (2019)9.92
- A Benchmark Of Dynamical Variational Autoencoders Applied To Speech Spectrogram Modeling (2021)6.77