Cross-lingual Speech Emotion Recognition: Humans Vs. Self-supervised Models
2024 Β· Zhichen Han, Tianqi Geng, Hui Feng, et al.
Abstract
Utilizing Self-Supervised Learning (SSL) models for Speech Emotion Recognition (SER) has proven effective, yet limited research has explored cross-lingual scenarios. This study presents a comparative analysis between human performance and SSL models, beginning with a layer-wise analysis and an exploration of parameter-efficient fine-tuning strategies in monolingual, cross-lingual, and transfer learning contexts. We further compare the SER ability of models and humans at both utterance- and segment-levels. Additionally, we investigate the impact of dialect on cross-lingual SER through human evaluation. Our findings reveal that models, with appropriate knowledge transfer, can adapt to the target language and achieve performance comparable to native speakers. We also demonstrate the significant effect of dialect on SER for individuals without prior linguistic and paralinguistic background. Moreover, both humans and models exhibit distinct behaviors across different emotions. These results
Authors
(none)
Tags
Stats
Related papers
- Semi-supervised Cross-lingual Speech Emotion Recognition (2022)10.85
- SER Evals: In-domain And Out-of-domain Benchmarking For Speech Emotion Recognition (2024)4.52
- Decoding Emotions: A Comprehensive Multilingual Study Of Speech Models For Speech Emotion Recognition (2023)0.00
- Exploring Self-supervised Multi-view Contrastive Learning For Speech Emotion Recognition With Limited Annotations (2024)3.58
- A Layer-anchoring Strategy For Enhancing Cross-lingual Speech Emotion Recognition (2024)0.00
- Exploring Acoustic Similarity In Emotional Speech And Music Via Self-supervised Representations (2024)3.58
- Leveraging Cross-attention Transformer And Multi-feature Fusion For Cross-linguistic Speech Emotion Recognition (2025)4.52
- Multilingual Speech Emotion Recognition With Multi-gating Mechanism And Neural Architecture Search (2022)2.26