Two-stage Framework For Robust Speech Emotion Recognition Using Target Speaker Extraction In Human Speech Noise Conditions
2024 Β· Jinyi Mi, Xiaohan Shi, Ding Ma, et al.
Abstract
Developing a robust speech emotion recognition (SER) system in noisy conditions faces challenges posed by different noise properties. Most previous studies have not considered the impact of human speech noise, thus limiting the application scope of SER. In this paper, we propose a novel two-stage framework for the problem by cascading target speaker extraction (TSE) method and SER. We first train a TSE model to extract the speech of target speaker from a mixture. Then, in the second stage, we utilize the extracted speech for SER training. Additionally, we explore a joint training of TSE and SER models in the second stage. Our developed system achieves a 14.33% improvement in unweighted accuracy (UA) compared to a baseline without using TSE method, demonstrating the effectiveness of our framework in mitigating the impact of human speech noise. Moreover, we conduct experiments considering speaker gender, showing that our framework performs particularly well in different-gender mixture.
Authors
(none)
Tags
Stats
Related papers
- Noise Robust Speech Emotion Recognition With Signal-to-noise Ratio Adapting Speech Enhancement (2023)0.00
- Trnet: Two-level Refinement Network Leveraging Speech Enhancement For Noise Robust Speech Emotion Recognition (2024)6.77
- Describe Where You Are: Improving Noise-robustness For Speech Emotion Recognition With Text Description Of The Environment (2024)4.52
- On The Efficacy And Noise-robustness Of Jointly Learned Speech Emotion And Automatic Speech Recognition (2023)3.58
- Active Learning Based Fine-tuning Framework For Speech Emotion Recognition (2023)6.34
- Improved Speech Emotion Recognition Using Transfer Learning And Spectrogram Augmentation (2021)12.74
- Lightweight Speech Enhancement Guided Target Speech Extraction In Noisy Multi-speaker Scenarios (2025)0.00
- Trustser: On The Trustworthiness Of Fine-tuning Pre-trained Speech Embeddings For Speech Emotion Recognition (2023)9.07