Trnet: Two-level Refinement Network Leveraging Speech Enhancement For Noise Robust Speech Emotion Recognition
2024 Β· Chengxin Chen, Pengyuan Zhang
Abstract
One persistent challenge in Speech Emotion Recognition (SER) is the ubiquitous environmental noise, which frequently results in deteriorating SER performance in practice. In this paper, we introduce a Two-level Refinement Network, dubbed TRNet, to address this challenge. Specifically, a pre-trained speech enhancement module is employed for front-end noise reduction and noise level estimation. Later, we utilize clean speech spectrograms and their corresponding deep representations as reference signals to refine the spectrogram distortion and representation shift of enhanced speech during model training. Experimental results validate that the proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments, without compromising its performance in noise-free environments.
Authors
(none)
Tags
Stats
Related papers
- Noise Robust Speech Emotion Recognition With Signal-to-noise Ratio Adapting Speech Enhancement (2023)0.00
- Two-stage Framework For Robust Speech Emotion Recognition Using Target Speaker Extraction In Human Speech Noise Conditions (2024)3.58
- Speech And Noise Dual-stream Spectrogram Refine Network With Speech Distortion Loss For Robust Speech Recognition (2023)5.24
- Describe Where You Are: Improving Noise-robustness For Speech Emotion Recognition With Text Description Of The Environment (2024)4.52
- Trustser: On The Trustworthiness Of Fine-tuning Pre-trained Speech Embeddings For Speech Emotion Recognition (2023)9.07
- Bridging The Gap: Integrating Pre-trained Speech Enhancement And Recognition Models For Robust Speech Recognition (2024)7.50
- Improved Speech Emotion Recognition Using Transfer Learning And Spectrogram Augmentation (2021)12.74
- Active Learning Based Fine-tuning Framework For Speech Emotion Recognition (2023)6.34