Msr-codec: A Low-bitrate Multi-stream Residual Codec For High-fidelity Speech Generation With Information Disentanglement
2025 Β· Jingyu Li, Guangyan Zhang, Zhen Ye, et al.
Abstract
Audio codecs are a critical component of modern speech generation systems. This paper introduces a low-bitrate, multi-scale residual codec that encodes speech into four distinct streams: semantic, timbre, prosody, and residual. This architecture achieves high-fidelity speech reconstruction at competitive low bitrates while demonstrating an inherent ability for information disentanglement. We construct a two-stage language model for text-to-speech (TTS) synthesis using this codec, which, despite its lightweight design and minimal data requirements, achieves a state-of-the-art Word Error Rate (WER) and superior speaker similarity compared to several larger models. Furthermore, the codec's design proves highly effective for voice conversion, enabling independent manipulation of speaker timbre and prosody. Our inference code, pre-trained models, and audio samples are available at https://github.com/herbertLJY/MSRCodec.
Authors
(none)
Tags
Stats
Code
Related papers
- Lscodec: Low-bitrate And Speaker-decoupled Discrete Speech Codec (2024)0.00
- Optimizing Neural Speech Codec For Low-bitrate Compression Via Multi-scale Encoding (2024)0.00
- Socodec: A Semantic-ordered Multi-stream Speech Codec For Efficient Language Model Based Text-to-speech Synthesis (2024)6.34
- Semanticodec: An Ultra Low Bitrate Semantic Audio Codec For General Sound (2024)10.97
- Freecodec: A Disentangled Neural Speech Codec With Fewer Tokens (2024)4.52
- Single-codec: Single-codebook Speech Codec Towards High-performance Speech Generation (2024)9.23
- Pscodec: A Series Of High-fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders (2024)0.00
- Spectral Codecs: Improving Non-autoregressive Speech Synthesis With Spectrogram-based Audio Codecs (2024)0.00