Livespeech: Low-latency Zero-shot Text-to-speech Via Autoregressive Modeling Of Audio Discrete Codes
2024 Β· Trung Dang, David Aponte, Dung Tran, et al.
Abstract
Prior works have demonstrated zero-shot text-to-speech by using a generative language model on audio tokens obtained via a neural audio codec. It is still challenging, however, to adapt them to low-latency scenarios. In this paper, we present LiveSpeech - a fully autoregressive language model-based approach for zero-shot text-to-speech, enabling low-latency streaming of the output audio. To allow multiple token prediction within a single decoding step, we propose (1) using adaptive codebook loss weights that consider codebook contribution in each frame and focus on hard instances, and (2) grouping codebooks and processing groups in parallel. Experiments show our proposed models achieve competitive results to state-of-the-art baselines in terms of content accuracy, speaker similarity, audio quality, and inference speed while being suitable for low-latency streaming applications.
Authors
(none)
Tags
Stats
Related papers
- Mobilespeech: A Fast And High-fidelity Framework For Mobile Zero-shot Text-to-speech (2024)0.00
- Clam-tts: Improving Neural Codec Language Model For Zero-shot Text-to-speech (2024)0.00
- High Quality Streaming Speech Synthesis With Low, Sentence-length-independent Latency (2021)8.60
- Speechx: Neural Codec Language Model As A Versatile Speech Transformer (2023)11.85
- Iterative Autoregression: A Novel Trick To Improve Your Low-latency Speech Enhancement Model (2022)5.24
- Tacolm: Gated Attention Equipped Codec Language Model Are Efficient Zero-shot Text To Speech Synthesizers (2024)0.00
- VALL-E R: Robust And Efficient Zero-shot Text-to-speech Synthesis Via Monotonic Alignment (2024)0.00
- Spark-tts: An Efficient Llm-based Text-to-speech Model With Single-stream Decoupled Speech Tokens (2025)8.08