Cross-utterance Conditioned VAE For Speech Generation
2023 Β· Yang Li, Cheng Yu, Guangzhi Sun, et al.
Abstract
Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, de
Authors
(none)
Tags
Stats
Related papers
- Audio-visual Speech Enhancement Using Conditional Variational Auto-encoders (2019)13.65
- Expressive Speech Synthesis Via Modeling Expressions With Variational Autoencoder (2018)13.88
- Using Vaes And Normalizing Flows For One-shot Text-to-speech Synthesis Of Expressive Speech (2019)9.92
- Chive: Varying Prosody In Speech Synthesis With A Linguistically Driven Dynamic Hierarchical Conditional Variational Network (2019)0.00
- Accented Text-to-speech Synthesis With A Conditional Variational Autoencoder (2022)0.00
- Unisyn: An End-to-end Unified Model For Text-to-speech And Singing Voice Synthesis (2022)0.00
- Cssinger: End-to-end Chunkwise Streaming Singing Voice Synthesis System Based On Conditional Variational Autoencoder (2024)0.00
- Unsupervised TTS Acoustic Modeling For TTS With Conditional Disentangled Sequential VAE (2022)5.84