Spark-tts: An Efficient Llm-based Text-to-speech Model With Single-stream Decoupled Speech Tokens
2025 Β· Xinsheng Wang, Mingqi Jiang, Ziyang Ma, et al.
Abstract
Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Exten
Authors
(none)
Tags
Stats
Related papers
- Clam-tts: Improving Neural Codec Language Model For Zero-shot Text-to-speech (2024)0.00
- Improving Audio Codec-based Zero-shot Text-to-speech Synthesis With Multi-modal Context And Large Language Model (2024)2.26
- Tacolm: Gated Attention Equipped Codec Language Model Are Efficient Zero-shot Text To Speech Synthesizers (2024)0.00
- Cosyvoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer Based On Supervised Semantic Tokens (2024)0.00
- Socodec: A Semantic-ordered Multi-stream Speech Codec For Efficient Language Model Based Text-to-speech Synthesis (2024)6.34
- Livespeech: Low-latency Zero-shot Text-to-speech Via Autoregressive Modeling Of Audio Discrete Codes (2024)5.84
- Disco-speech: Controllable Zero-shot Speech Generation With A Disentangled Speech Codec (2025)0.00
- Single-codec: Single-codebook Speech Codec Towards High-performance Speech Generation (2024)9.23