Enhancing Speech Emotion Recognition Through Segmental Average Pooling Of Self-supervised Learning Features
2024 Β· Jonghwan Hyeon, Yung-Hwan Oh, Ho-Jin Choi
Abstract
Speech Emotion Recognition (SER) analyzes human emotions expressed through speech. Self-supervised learning (SSL) offers a promising approach to SER by learning meaningful representations from a large amount of unlabeled audio data. However, existing SSL-based methods rely on Global Average Pooling (GAP) to represent audio signals, treating speech and non-speech segments equally. This can lead to dilution of informative speech features by irrelevant non-speech information. To address this, the paper proposes Segmental Average Pooling (SAP), which selectively focuses on informative speech segments while ignoring non-speech segments. By applying both GAP and SAP to SSL features, our approach utilizes overall speech signal information from GAP and specific information from SAP, leading to improved SER performance. Experiments show state-of-the-art results on the IEMOCAP for English and superior performance on KEMDy19 for Korean datasets in both unweighted and weighted accuracies.
Authors
(none)
Tags
Stats
Related papers
- End-to-end Integration Of Speech Emotion Recognition With Voice Activity Detection Using Self-supervised Learning Features (2024)0.00
- Exploring Self-supervised Multi-view Contrastive Learning For Speech Emotion Recognition With Limited Annotations (2024)3.58
- Unsupervised Representations Improve Supervised Learning In Speech Emotion Recognition (2023)0.00
- SER Evals: In-domain And Out-of-domain Benchmarking For Speech Emotion Recognition (2024)4.52
- Emotion-aware Speech Self-supervised Representation Learning With Intensity Knowledge (2024)3.58
- Investigating Self-supervised Learning For Speech Enhancement And Separation (2022)13.44
- Cross-lingual Speech Emotion Recognition: Humans Vs. Self-supervised Models (2024)5.84
- Leveraging Semantic Information For Efficient Self-supervised Emotion Recognition With Audio-textual Distilled Models (2023)6.34