SER Evals: In-domain And Out-of-domain Benchmarking For Speech Emotion Recognition
2024 Β· Mohamed Osman, Daniel Z. Kaplan, Tamer Nadeem
Abstract
Speech emotion recognition (SER) has made significant strides with the advent of powerful self-supervised learning (SSL) models. However, the generalization of these models to diverse languages and emotional expressions remains a challenge. We propose a large-scale benchmark to evaluate the robustness and adaptability of state-of-the-art SER models in both in-domain and out-of-domain settings. Our benchmark includes a diverse set of multilingual datasets, focusing on less commonly used corpora to assess generalization to new data. We employ logit adjustment to account for varying class distributions and establish a single dataset cluster for systematic evaluation. Surprisingly, we find that the Whisper model, primarily designed for automatic speech recognition, outperforms dedicated SSL models in cross-lingual SER. Our results highlight the need for more robust and generalizable SER models, and our benchmark serves as a valuable resource to drive future research in this direction.
Authors
(none)
Tags
Stats
Related papers
- What Does It Take To Generalize SER Model Across Datasets? A Comprehensive Benchmark (2024)0.00
- Cross-lingual Speech Emotion Recognition: Humans Vs. Self-supervised Models (2024)5.84
- Emobox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit And Benchmark (2024)11.49
- Semi-supervised Cross-lingual Speech Emotion Recognition (2022)10.85
- Exploring Self-supervised Multi-view Contrastive Learning For Speech Emotion Recognition With Limited Annotations (2024)3.58
- Speecheq: Speech Emotion Recognition Based On Multi-scale Unified Datasets And Multitask Learning (2022)5.84
- Decoding Emotions: A Comprehensive Multilingual Study Of Speech Models For Speech Emotion Recognition (2023)0.00
- Trustser: On The Trustworthiness Of Fine-tuning Pre-trained Speech Embeddings For Speech Emotion Recognition (2023)9.07