Modeling Strategies For Speech Enhancement In The Latent Space Of A Neural Audio Codec
2025 Β· Sofiene Kammoun, Xavier Alameda-Pineda, Simon Leglaive
Abstract
Neural audio codecs (NACs) provide compact latent speech representations in the form of sequences of continuous vectors or discrete tokens. In this work, we investigate how these two types of speech representations compare when used as training targets for supervised speech enhancement. We consider both autoregressive and non-autoregressive speech enhancement models based on the Conformer architecture, as well as a simple baseline where the NAC encoder is simply fine-tuned for speech enhancement. Our experiments reveal three key findings: predicting continuous latent representations consistently outperforms discrete token prediction; autoregressive models achieve higher quality but at the expense of intelligibility and efficiency, making non-autoregressive models more attractive in practice; and adding encoder fine-tuning yields the strongest enhancement metrics overall, though at the cost of degraded codec reconstruction. The code and audio samples are available online.
Authors
(none)
Tags
Stats
Related papers
- Neural Speech And Audio Coding: Modern AI Technology Meets Traditional Codecs (2024)7.16
- Investigating Neural Audio Codecs For Speech Language Model-based Speech Generation (2024)2.26
- Speaker Anonymization Using Neural Audio Codec Language Models (2023)10.97
- Livespeech: Low-latency Zero-shot Text-to-speech Via Autoregressive Modeling Of Audio Discrete Codes (2024)5.84
- Enhancing Into The Codec: Noise Robust Speech Coding With Vector-quantized Autoencoders (2021)10.21
- Nolace: Improving Low-complexity Speech Codec Enhancement Through Adaptive Temporal Shaping (2023)7.16
- A Neural Speech Codec For Noise Robust Speech Coding (2023)0.00
- Latent-domain Predictive Neural Speech Coding (2022)12.15