Hierarchical And Multi-scale Variational Autoencoder For Diverse And Natural Non-autoregressive Text-to-speech
2022 Β· Jae-Sung Bae, Jinhyeok Yang, Tae-Jun Bak, et al.
Abstract
This paper proposes a hierarchical and multi-scale variational autoencoder-based non-autoregressive text-to-speech model (HiMuV-TTS) to generate natural speech with diverse speaking styles. Recent advances in non-autoregressive TTS (NAR-TTS) models have significantly improved the inference speed and robustness of synthesized speech. However, the diversity of speaking styles and naturalness are needed to be improved. To solve this problem, we propose the HiMuV-TTS model that first determines the global-scale prosody and then determines the local-scale prosody via conditioning on the global-scale prosody and the learned text representation. In addition, we improve the quality of speech by adopting the adversarial training technique. Experimental results verify that the proposed HiMuV-TTS model can generate more diverse and natural speech as compared to TTS models with single-scale variational autoencoders, and can represent different prosody information in each scale.
Authors
(none)
Tags
Stats
Related papers
- Conditional Variational Autoencoder With Adversarial Learning For End-to-end Text-to-speech (2021)0.00
- Hierarchical Multi-grained Generative Model For Expressive Speech Synthesis (2020)8.60
- Hierarchical Prosody Modeling And Control In Non-autoregressive Parallel Neural TTS (2021)8.35
- Hierarchical Generative Modeling For Controllable Speech Synthesis (2018)0.00
- Hignn-tts: Hierarchical Prosody Modeling With Graph Neural Networks For Expressive Long-form TTS (2023)5.84
- Generating Diverse And Natural Text-to-speech Samples Using A Quantized Fine-grained VAE And Auto-regressive Prosody Prior (2020)12.54
- Msstyletts: Multi-scale Style Modeling With Hierarchical Context Information For Expressive Speech Synthesis (2023)6.77
- Using Generative Modelling To Produce Varied Intonation For Speech Synthesis (2019)7.81