Describe Where You Are: Improving Noise-robustness For Speech Emotion Recognition With Text Description Of The Environment
2024 Β· Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, et al.
Abstract
Speech emotion recognition (SER) systems often struggle in real-world environments, where ambient noise severely degrades their performance. This paper explores a novel approach that exploits prior knowledge of testing environments to maximize SER performance under noisy conditions. To address this task, we propose a text-guided, environment-aware training where an SER model is trained with contaminated speech samples and their paired noise description. We use a pre-trained text encoder to extract the text-based environment embedding and then fuse it to a transformer-based SER model during training and inference. We demonstrate the effectiveness of our approach through our experiment with the MSP-Podcast corpus and real-world additive noise samples collected from the Freesound and DEMAND repositories. Our experiment indicates that the text-based environment descriptions processed by a large language model (LLM) produce representations that improve the noise-robustness of the SER system
Authors
(none)
Tags
Stats
Related papers
- Two-stage Framework For Robust Speech Emotion Recognition Using Target Speaker Extraction In Human Speech Noise Conditions (2024)3.58
- Noise Robust Speech Emotion Recognition With Signal-to-noise Ratio Adapting Speech Enhancement (2023)0.00
- Trnet: Two-level Refinement Network Leveraging Speech Enhancement For Noise Robust Speech Emotion Recognition (2024)6.77
- Environment Aware Text-to-speech Synthesis (2021)6.34
- On The Efficacy And Noise-robustness Of Jointly Learned Speech Emotion And Automatic Speech Recognition (2023)3.58
- Leveraging Speech PTM, Text LLM, And Emotional TTS For Speech Emotion Recognition (2023)10.97
- Towards Interpretable And Transferable Speech Emotion Recognition: Latent Representation Based Analysis Of Features, Methods And Corpora (2021)0.00
- Active Learning Based Fine-tuning Framework For Speech Emotion Recognition (2023)6.34