Tacolm: Gated Attention Equipped Codec Language Model Are Efficient Zero-shot Text To Speech Synthesizers
2024 Β· Yakun Song, Zhuo Chen, Xiaofei Wang, et al.
Abstract
Neural codec language model (LM) has demonstrated strong capability in zero-shot text-to-speech (TTS) synthesis. However, the codec LM often suffers from limitations in inference speed and stability, due to its auto-regressive nature and implicit alignment between text and audio. In this work, to handle these challenges, we introduce a new variant of neural codec LM, namely TacoLM. Specifically, TacoLM introduces a gated attention mechanism to improve the training and inference efficiency and reduce the model size. Meanwhile, an additional gated cross-attention layer is included for each decoder layer, which improves the efficiency and content accuracy of the synthesized speech. In the evaluation of the Librispeech corpus, the proposed TacoLM achieves a better word error rate, speaker similarity, and mean opinion score, with 90% fewer parameters and 5.2 times speed up, compared with VALL-E. Demo and code is available at https://ereboas.github.io/TacoLM/.
Authors
(none)
Tags
Stats
Related papers
- Clam-tts: Improving Neural Codec Language Model For Zero-shot Text-to-speech (2024)0.00
- VALL-E R: Robust And Efficient Zero-shot Text-to-speech Synthesis Via Monotonic Alignment (2024)0.00
- Fewer-token Neural Speech Codec With Time-invariant Codes (2023)10.07
- Spark-tts: An Efficient Llm-based Text-to-speech Model With Single-stream Decoupled Speech Tokens (2025)8.08
- Livespeech: Low-latency Zero-shot Text-to-speech Via Autoregressive Modeling Of Audio Discrete Codes (2024)5.84
- VALL-E 2: Neural Codec Language Models Are Human Parity Zero-shot Text To Speech Synthesizers (2024)0.00
- ELLA-V: Stable Neural Codec Language Modeling With Alignment-guided Sequence Reordering (2024)0.00
- HALL-E: Hierarchical Neural Codec Language Model For Minute-long Zero-shot Text-to-speech Synthesis (2024)0.00