Lina-speech: Gated Linear Attention And Initial-state Tuning For Multi-sample Prompting Text-to-speech Synthesis
2024 · Théodor Lemerle, Téo Guichoux, Axel Roebel, et al.
Abstract
Neural codec language models, built on transformer architecture, have revolutionized text-to-speech (TTS) synthesis, excelling in voice cloning by treating it as a prefix continuation task. However, their limited context length hinders their effectiveness to short speech samples. As a result, the voice cloning ability is restricted to a limited coverage and diversity of the speaker's prosody and style. Besides, adapting prosody, accent, or appropriate emotion from a short prefix remains a challenging task. Finally, the quadratic complexity of self-attention limits inference throughput. In this work, we introduce Lina-Speech, a TTS model with Gated Linear Attention (GLA) to replace standard self-attention as a principled backbone, improving inference throughput while matching state-of-the-art performance. Leveraging the stateful property of recurrent architecture, we introduce an Initial-State Tuning (IST) strategy that unlocks the possibility of multiple speech sample conditioning of a
Authors
(none)
Tags
Stats
Related papers
- Small-e: Small Language Model With Linear Attention For Efficient Speech Synthesis (2024)9.02
- Clam-tts: Improving Neural Codec Language Model For Zero-shot Text-to-speech (2024)0.00
- Improving Language Model-based Zero-shot Text-to-speech Synthesis With Multi-scale Acoustic Prompts (2023)3.58
- Tacolm: Gated Attention Equipped Codec Language Model Are Efficient Zero-shot Text To Speech Synthesizers (2024)0.00
- RALL-E: Robust Codec Language Modeling With Chain-of-thought Prompting For Text-to-speech Synthesis (2024)0.00
- HD-PPT: Hierarchical Decoding Of Content- And Prompt-preference Tokens For Instruction-based TTS (2025)0.00
- Improving Robustness Of Llm-based Speech Synthesis By Learning Monotonic Alignment (2024)0.00
- High Quality, Lightweight And Adaptable TTS Using Lpcnet (2019)10.97