End-to-end Text-to-speech Using Latent Duration Based On VQ-VAE
2020 Β· Yusuke Yasuda, Xin Wang, Junichi Yamagishi
Abstract
Explicit duration modeling is a key to achieving robust and efficient alignment in text-to-speech synthesis (TTS). We propose a new TTS framework using explicit duration modeling that incorporates duration as a discrete latent variable to TTS and enables joint optimization of whole modules from scratch. We formulate our method based on conditional VQ-VAE to handle discrete duration in a variational autoencoder and provide a theoretical explanation to justify our method. In our framework, a connectionist temporal classification (CTC) -based force aligner acts as the approximate posterior, and text-to-duration works as the prior in the variational autoencoder. We evaluated our proposed method with a listening test and compared it with other TTS methods based on soft-attention or explicit duration modeling. The results showed that our systems rated between soft-attention-based methods (Transformer-TTS, Tacotron2) and explicit duration modeling-based methods (Fastspeech).
Authors
(none)
Tags
Stats
Related papers
- Text-to-speech Synthesis Based On Latent Variable Conversion Using Diffusion Probabilistic Model And Variational Autoencoder (2022)0.00
- Autotts: End-to-end Text-to-speech Synthesis Through Differentiable Duration Modeling (2022)0.00
- Aligner-guided Training Paradigm: Advancing Text-to-speech Models With Aligner Guided Duration (2024)0.00
- Generating Diverse And Natural Text-to-speech Samples Using A Quantized Fine-grained VAE And Auto-regressive Prosody Prior (2020)12.54
- Conditional Variational Autoencoder With Adversarial Learning For End-to-end Text-to-speech (2021)0.00
- VARA-TTS: Non-autoregressive Text-to-speech Synthesis Based On Very Deep VAE With Residual Attention (2021)0.00
- Unsupervised TTS Acoustic Modeling For TTS With Conditional Disentangled Sequential VAE (2022)5.84
- Continuous Autoregressive Modeling With Stochastic Monotonic Alignment For Speech Synthesis (2025)0.00