Optimizing Neural Speech Codec For Low-bitrate Compression Via Multi-scale Encoding
2024 Β· Peiji Yang, Fengping Wang, Yicheng Zhong, et al.
Abstract
Neural speech codecs have demonstrated their ability to compress high-quality speech and audio by converting them into discrete token representations. Most existing methods utilize Residual Vector Quantization (RVQ) to encode speech into multiple layers of discrete codes with uniform time scales. However, this strategy overlooks the differences in information density across various speech features, leading to redundant encoding of sparse information, which limits the performance of these methods at low bitrate. This paper proposes MsCodec, a novel multi-scale neural speech codec that encodes speech into multiple layers of discrete codes, each corresponding to a different time scale. This encourages the model to decouple speech features according to their diverse information densities, consequently enhancing the performance of speech compression. Furthermore, we incorporate mutual information loss to augment the diversity among speech codes across different layers. Experimental results
Authors
(none)
Tags
Stats
Related papers
- CQNV: A Combination Of Coarsely Quantized Bitstream And Neural Vocoder For Low Rate Speech Coding (2023)6.34
- Mdctcodec: A Lightweight Mdct-based Neural Audio Codec Towards High Sampling Rate And Low Bitrate Scenarios (2024)8.09
- Msr-codec: A Low-bitrate Multi-stream Residual Codec For High-fidelity Speech Generation With Information Disentanglement (2025)2.35
- Freecodec: A Disentangled Neural Speech Codec With Fewer Tokens (2024)4.52
- Pscodec: A Series Of High-fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders (2024)0.00
- ESC: Efficient Speech Coding With Cross-scale Residual Vector Quantized Transformers (2024)5.84
- Efficient And Scalable Neural Residual Waveform Coding With Collaborative Quantization (2020)8.60
- Spatialcodec: Neural Spatial Speech Coding (2023)3.69