Perceptual Loss Based Speech Denoising With An Ensemble Of Audio Pattern Recognition And Self-supervised Models
2020 · Saurabh Kataria, Jesús Villalba, Najim Dehak
Abstract
Deep learning based speech denoising still suffers from the challenge of improving perceptual quality of enhanced signals. We introduce a generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses. Perceptual loss discourages distortion to certain speech properties and we analyze it using six large-scale pre-trained models: speaker classification, acoustic model, speaker embedding, emotion classification, and two self-supervised speech encoders (PASE+, wav2vec 2.0). We first build a strong baseline (w/o PERL) using Conformer Transformer Networks on the popular enhancement benchmark called VCTK-DEMAND. Using auxiliary models one at a time, we find acoustic event and self-supervised model PASE+ to be most effective. Our best model (PERL-AE) only uses acoustic event model (utilizing AudioSet) to outperform state-of-the-art methods on major perceptual metrics. To explore if denoising can leverage full framework, we use all networks bu
Authors
(none)
Tags
Stats
Related papers
- PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation Extraction (2021)6.77
- A Comparative Evaluation Of Deep Learning Models For Speech Enhancement In Real-world Noisy Environments (2025)0.00
- Speech Denoising With Deep Feature Losses (2018)14.23
- Feature Learning And Ensemble Pre-tasks Based Self-supervised Speech Denoising And Dereverberation (2022)0.00
- Perceive And Predict: Self-supervised Speech Representation Based Loss Functions For Speech Enhancement (2023)7.16
- Feature Enhancement With Deep Feature Losses For Speaker Verification (2019)10.61
- Mp-senet: A Speech Enhancement Model With Parallel Denoising Of Magnitude And Phase Spectra (2023)15.51
- Unsupervised Speech Enhancement With Speech Recognition Embedding And Disentanglement Losses (2021)8.35