Unsupervised TTS Acoustic Modeling For TTS With Conditional Disentangled Sequential VAE
2022 Β· Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, et al.
Abstract
In this paper, we propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs. UTTS is a multi-speaker speech synthesizer that supports zero-shot voice cloning, it is developed from a perspective of disentangled speech representation learning. The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference. We leverage recent advancements in self-supervised speech representation learning as well as speech synthesis front-end techniques for system development. Specifically, we employ our recently formulated Conditional Disentangled Sequential Variational Auto-encoder (C-DSVAE) as the backbone UTTS AM, which offers well-structured content representations given unsupervised alignment (UA) as condition during training. For UTTS inference, we utilize a lexicon to map input text to the phoneme sequence, which is expanded to the frame-level forced alignment (FA) wi
Authors
(none)
Tags
Stats
Related papers
- Unisyn: An End-to-end Unified Model For Text-to-speech And Singing Voice Synthesis (2022)0.00
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97
- Nnspeech: Speaker-guided Conditional Variational Autoencoder For Zero-shot Multi-speaker Text-to-speech (2022)9.59
- Text-to-speech Synthesis Based On Latent Variable Conversion Using Diffusion Probabilistic Model And Variational Autoencoder (2022)0.00
- End-to-end Text-to-speech Using Latent Duration Based On VQ-VAE (2020)6.77
- Semi-supervised Learning For Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation (2020)5.24
- VARA-TTS: Non-autoregressive Text-to-speech Synthesis Based On Very Deep VAE With Residual Attention (2021)0.00
- Continuous Autoregressive Modeling With Stochastic Monotonic Alignment For Speech Synthesis (2025)0.00