VARA-TTS: Non-autoregressive Text-to-speech Synthesis Based On Very Deep VAE With Residual Attention
2021 Β· Peng Liu, Yuewen Cao, Songxiang Liu, et al.
Abstract
This paper proposes VARA-TTS, a non-autoregressive (non-AR) text-to-speech (TTS) model using a very deep Variational Autoencoder (VDVAE) with Residual Attention mechanism, which refines the textual-to-acoustic alignment layer-wisely. Hierarchical latent variables with different temporal resolutions from the VDVAE are used as queries for residual attention module. By leveraging the coarse global alignment from previous attention layer as an extra input, the following attention layer can produce a refined version of alignment. This amortizes the burden of learning the textual-to-acoustic alignment among multiple attention layers and outperforms the use of only a single attention layer in robustness. An utterance-level speaking speed factor is computed by a jointly-trained speaking speed predictor, which takes the mean-pooled latent variables of the coarsest layer as input, to determine number of acoustic frames at inference. Experimental results show that VARA-TTS achieves slightly infer
Authors
(none)
Tags
Stats
Related papers
- VAENAR-TTS: Variational Auto-encoder Based Non-autoregressive Text-to-speech Synthesis (2021)7.50
- Text-to-speech Synthesis Based On Latent Variable Conversion Using Diffusion Probabilistic Model And Variational Autoencoder (2022)0.00
- Generating Diverse And Natural Text-to-speech Samples Using A Quantized Fine-grained VAE And Auto-regressive Prosody Prior (2020)12.54
- Conditional Variational Autoencoder With Adversarial Learning For End-to-end Text-to-speech (2021)0.00
- Continuous Autoregressive Modeling With Stochastic Monotonic Alignment For Speech Synthesis (2025)0.00
- End-to-end Text-to-speech Using Latent Duration Based On VQ-VAE (2020)6.77
- Unsupervised TTS Acoustic Modeling For TTS With Conditional Disentangled Sequential VAE (2022)5.84
- Parallel Tacotron: Non-autoregressive And Controllable TTS (2020)12.54