Unet-based Fusion And Exponential Moving Average Adaptation For Noise-robust Speaker Recognition
2026 Β· Chong-Xin Gan, Peter Bell, Man-Wai Mak, et al.
Abstract
arXiv:2604.25624v1 Announce Type: new Abstract: The joint training of speech enhancement and speaker embedding networks for speaker recognition is widely adopted under noisy acoustic environments. While effective, this paradigm often fails to leverage the generalization and robustness benefits inherent in large-scale speech enhancement pre-training. Moreover, maintaining the speaker information in the denoised speech is not an explicit objective of the speech enhancement process. To address these limitations, we proposed a scalable \textbf\{U\}Net-based \textbf\{F\}usion framework (UF-EMA) that considers the noisy and enhanced speech as a multi-channel input, thereby enabling the speaker encoder to exploit speaker information effectively. In addition, an \textbf\{E\}xponential \textbf\{M\}oving \textbf\{A\}verage strategy is applied to a speaker encoder pre-trained on clean speech to mitigate overfitting and facilitate a smooth transition from clean to noisy conditions. Experimental r
Authors
(none)
Tags
Stats
Related papers
- Obovox Far Field Speaker Recognition: A Novel Data Augmentation Approach With Pretrained Models (2024)0.00
- An Enhanced Res2net With Local And Global Feature Fusion For Speaker Verification (2023)19.74
- Noise-conditioned Mixture-of-experts Framework For Robust Speaker Verification (2025)0.00
- ECAPA2: A Hybrid Neural Network Architecture And Training Strategy For Robust Speaker Embeddings (2024)0.00
- Eres2netv2: Boosting Short-duration Speaker Verification Performance With Computational Efficiency (2024)9.41
- Fusion Of Embeddings Networks For Robust Combination Of Text Dependent And Independent Speaker Recognition (2021)4.52
- On The Use Of DNN Autoencoder For Robust Speaker Recognition (2018)0.00
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)16.67