Disco-speech: Controllable Zero-shot Speech Generation With A Disentangled Speech Codec
2025 Β· Tao Li, Wenshuo Ge, Zhichao Wang, et al.
Abstract
Codec-based language models (LMs) have revolutionized text-to-speech (TTS). However, standard codecs entangle timbre and prosody, which hinders independent control in continuation-based LMs. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework featuring a disentangled speech codec (DisCodec) and an LM-based generator. The core component DisCodec employs a two-stage design: 1) tri-factor disentanglement to separate speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) fusion and reconstruction that merges content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction to address the disentanglement-reconstruction trade-off. This allows the LM to perform prosodic continuation from a style prompt while the decoder injects target timbre, enabling flexible zero-shot control. Experiments demonstrate that DisCo-Speech achieves competitive voice cloning
Authors
(none)
Tags
Stats
Related papers
- Lscodec: Low-bitrate And Speaker-decoupled Discrete Speech Codec (2024)0.00
- Freecodec: A Disentangled Neural Speech Codec With Fewer Tokens (2024)4.52
- Controlspeech: Towards Simultaneous And Independent Zero-shot Speaker Cloning And Zero-shot Language Style Control (2024)9.40
- Socodec: A Semantic-ordered Multi-stream Speech Codec For Efficient Language Model Based Text-to-speech Synthesis (2024)6.34
- Codiff-vc: A Codec-assisted Diffusion Model For Zero-shot Voice Conversion (2024)0.00
- Msr-codec: A Low-bitrate Multi-stream Residual Codec For High-fidelity Speech Generation With Information Disentanglement (2025)2.35
- Spark-tts: An Efficient Llm-based Text-to-speech Model With Single-stream Decoupled Speech Tokens (2025)8.08
- Livespeech: Low-latency Zero-shot Text-to-speech Via Autoregressive Modeling Of Audio Discrete Codes (2024)5.84