Llasa: Scaling Train-time And Inference-time Compute For Llama-based Speech Synthesis
2025 Β· Zhen Ye, Xinfa Zhu, Chi-Min Chan, et al.
Abstract
Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore,
Authors
(none)
Tags
Stats
Related papers
- Get Large Language Models Ready To Speak: A Late-fusion Approach For Speech Generation (2024)5.24
- Boosting Large Language Model For Speech Synthesis: An Empirical Study (2023)6.77
- Audiochatllama: Towards General-purpose Speech Abilities For Llms (2023)9.41
- Llama-mimi: Exploring The Limits Of Flattened Speech Language Modeling (2025)0.00
- Llast: Improved End-to-end Speech Translation System Leveraged By Large Language Models (2024)10.67
- Llama-omni: Seamless Speech Interaction With Large Language Models (2024)0.00
- Transducer-llama: Integrating Llms Into Streamable Transducer-based Speech Recognition (2024)3.58
- Spark-tts: An Efficient Llm-based Text-to-speech Model With Single-stream Decoupled Speech Tokens (2025)8.08