Autotts: End-to-end Text-to-speech Synthesis Through Differentiable Duration Modeling
2022 Β· Bac Nguyen, Fabien Cardinaux, Stefan Uhlich
Abstract
Parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis. However, they typically require external alignment models, which are not necessarily optimized for the decoder as they are not jointly trained. In this paper, we propose a differentiable duration method for learning monotonic alignments between input and output sequences. Our method is based on a soft-duration mechanism that optimizes a stochastic process in expectation. Using this differentiable duration method, we introduce AutoTTS, a direct text-to-waveform speech synthesis model. AutoTTS enables high-fidelity speech synthesis through a combination of adversarial training and matching the total ground-truth duration. Experimental results show that our model obtains competitive results while enjoying a much simpler training pipeline. Audio samples are available online.
Authors
(none)
Tags
Stats
Related papers
- Parallel Tacotron 2: A Non-autoregressive Neural TTS Model With Differentiable Duration Modeling (2021)12.68
- Conditional Variational Autoencoder With Adversarial Learning For End-to-end Text-to-speech (2021)0.00
- Continuous Autoregressive Modeling With Stochastic Monotonic Alignment For Speech Synthesis (2025)0.00
- End-to-end Text-to-speech Using Latent Duration Based On VQ-VAE (2020)6.77
- End-to-end Adversarial Text-to-speech (2020)0.00
- Reinforce-aligner: Reinforcement Alignment Search For Robust End-to-end Text-to-speech (2021)8.09
- Efficienttts: An Efficient And High-quality Text-to-speech Architecture (2020)0.00
- Aligner-guided Training Paradigm: Advancing Text-to-speech Models With Aligner Guided Duration (2024)0.00