Aligning Generative Speech Enhancement With Perceptual Feedback
2025 Β· Haoyang Li, Nana Hou, Yuchen Hu, et al.
Abstract
Language Model (LM)-based speech enhancement (SE) has recently emerged as a promising direction, but existing approaches predominantly rely on token-level likelihood objectives that weakly reflect human perception. This mismatch limits progress, as optimizing signal accuracy does not always improve naturalness or listening comfort. We address this gap by introducing a perceptually aligned LM-based SE approach. Our method applies Direct Preference Optimization (DPO) with UTMOS, a neural MOS predictor, as a proxy for human ratings, directly steering models toward perceptually preferred outputs. This design directly connects model training to perceptual quality and is broadly applicable within LM-based SE frameworks. On the Deep Noise Suppression Challenge 2020 test sets, our approach consistently improves speech quality metrics, achieving relative gains of up to 56%. To our knowledge, this is the first integration of perceptual feedback into LM-based SE and the first application of DPO i
Authors
(none)
Tags
Stats
Related papers
- Attention-based Speech Enhancement Using Human Quality Perception Modelling (2023)0.00
- Multi-metric Preference Alignment For Generative Speech Restoration (2025)2.26
- Sense: Semantic-aware High-fidelity Universal Speech Enhancement (2025)3.85
- Reinforcement Learning Based Speech Enhancement For Robust Speech Recognition (2018)11.08
- Using RLHF To Align Speech Enhancement Approaches To Mean-opinion Quality Scores (2024)0.00
- SELM: Speech Enhancement Using Discrete Tokens And Language Models (2023)11.19
- Multi-metric Optimization Using Generative Adversarial Networks For Near-end Speech Intelligibility Enhancement (2021)8.60
- Multi-cmgan+/+: Leveraging Multi-objective Speech Quality Metric Prediction For Speech Enhancement (2023)0.00