Speaking Clearly: A Simplified Whisper-based Codec For Low-bitrate Speech Coding
2025 Β· Xin Zhang, Lin Li, Xiangni Lu, et al.
Abstract
Speech codecs serve as bridges between continuous speech signals and large language models, yet face an inherent conflict between acoustic fidelity and semantic preservation. To mitigate this conflict, prevailing methods augment acoustic codecs with complex semantic supervision. We explore the opposite direction: a semantic-first approach that starts from a semantically-capable model and adapts it for high-fidelity acoustic reconstruction. Through empirical analysis, we discover that targeted architectural simplification can unlock the acoustic modeling potential of Whisper, a text-aligned Automatic Speech Recognition (ASR) model. Based on this finding, we propose SimWhisper-Codec, a novel codec that balances the semantic and acoustic preservation by leveraging a frozen, simplified Whisper encoder without requiring external supervision. Experimental results demonstrate that SimWhisper-Codec achieves superior performance in both semantic preservation and acoustic quality compared to sem
Authors
(none)
Tags
Stats
Related papers
- Msr-codec: A Low-bitrate Multi-stream Residual Codec For High-fidelity Speech Generation With Information Disentanglement (2025)2.35
- Simul-whisper: Attention-guided Streaming Whisper With Truncation Detection (2024)6.34
- Whispervc: Decoupled Cross-domain Alignment And Speech Generation For Low-resource Whisper-to-normal Conversion (2025)0.00
- Lscodec: Low-bitrate And Speaker-decoupled Discrete Speech Codec (2024)0.00
- Semanticodec: An Ultra Low Bitrate Semantic Audio Codec For General Sound (2024)10.97
- Adapting Whisper For Code-switching Through Encoding Refining And Language-aware Decoding (2024)0.00
- Socodec: A Semantic-ordered Multi-stream Speech Codec For Efficient Language Model Based Text-to-speech Synthesis (2024)6.34
- Spectral Codecs: Improving Non-autoregressive Speech Synthesis With Spectrogram-based Audio Codecs (2024)0.00