Fast And High-quality Auto-regressive Speech Synthesis Via Speculative Decoding
2024 Β· Bohan Li, Hankun Wang, Situo Zhang, et al.
Abstract
The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.
Authors
(none)
Tags
Stats
Related papers
- Accelerating Codec-based Speech Synthesis With Multi-token Prediction And Speculative Decoding (2024)4.52
- Principled Coarse-grained Acceptance For Speculative Decoding In Speech (2025)0.00
- Generating Diverse And Natural Text-to-speech Samples Using A Quantized Fine-grained VAE And Auto-regressive Prosody Prior (2020)12.54
- Expediting TTS Synthesis With Adversarial Vocoding (2019)6.77
- Parallel Synthesis For Autoregressive Speech Generation (2022)4.52
- Conditional Variational Autoencoder With Adversarial Learning For End-to-end Text-to-speech (2021)0.00
- Livespeech: Low-latency Zero-shot Text-to-speech Via Autoregressive Modeling Of Audio Discrete Codes (2024)5.84
- VARA-TTS: Non-autoregressive Text-to-speech Synthesis Based On Very Deep VAE With Residual Attention (2021)0.00