Codecslime: Temporal Redundancy Compression Of Neural Speech Codec Via Dynamic Frame Rate
2025 Β· Hankun Wang, Yiwei Guo, Chongtian Shao, et al.
Abstract
Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address this mismatch, we present CodecSlime, a plugin-style method for compressing temporal redundancy through supporting dynamic frame rate (DFR) on neural speech codecs for the first time. Our method is unsupervised and architecture-agnostic, combining two key innovations, ScheDFR and Melt-and-Cool, for adapting inference and training, respectively. When integrated into a typical VQ-GAN codec backbone and operating at 40 Hz DFR (\(\approx\) 600 bps), the reconstruction WER of CodecSlime is reduced by up to 32% relative to conventional FFR baselines with the same model architecture and similar bitrate
Authors
(none)
Tags
Stats
Related papers
- Flexicodec: A Dynamic Neural Audio Codec For Low Frame Rates (2025)3.38
- Optimizing Neural Speech Codec For Low-bitrate Compression Via Multi-scale Encoding (2024)0.00
- Freecodec: A Disentangled Neural Speech Codec With Fewer Tokens (2024)4.52
- Low Frame-rate Speech Codec: A Codec Designed For Fast High-quality Speech LLM Training And Inference (2024)5.24
- Pscodec: A Series Of High-fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders (2024)0.00
- Latent-domain Predictive Neural Speech Coding (2022)12.15
- Stftcodec: High-fidelity Audio Compression Through Time-frequency Domain Representation (2025)2.26
- Spatialcodec: Neural Spatial Speech Coding (2023)3.69