Beyond Noise Suppression: Dynamic Distortion Control Loss for Speech Enhancement and Robust Automatic Speech Recognition

Abstract

In severe noise conditions, employing a speech enhancement (SE) model as a front-end serves as a computationally efficient strategy for robust automatic speech recognition (ASR), offering a practical alternative to the costly fine-tuning of large-scale ASR systems. However, improvements in human perceptual quality do not necessarily guarantee enhanced machine recognition accuracy, as aggressive noise suppression often introduces distortions and artifacts that obscure fine-grained spectral details and increase word error rates (WERs). To mitigate the discrepancy, we present the Dynamic Distortion Control (DDC) loss, a unified training objective designed to bridge the gap between perceptual fidelity and recognition robustness. Integrated into a time-frequency Transformer architecture, the presented loss addresses the distortion-robustness discrepancy. Experimental results on a LibriSpeech dataset corrupted by noise from DNS Challenge demonstrate the effectiveness of the DDC loss on both perceptual quality and recognition accuracy across diverse noise conditions.

Related papers

Ranked by semantic similarity — how closely each paper's abstract matches this one (100% = near-identical topic).