Decoding Emotions: A Comprehensive Multilingual Study Of Speech Models For Speech Emotion Recognition
2023 Β· Anant Singh, Akshat Gupta
Abstract
Recent advancements in transformer-based speech representation models have greatly transformed speech processing. However, there has been limited research conducted on evaluating these models for speech emotion recognition (SER) across multiple languages and examining their internal representations. This article addresses these gaps by presenting a comprehensive benchmark for SER with eight speech representation models and six different languages. We conducted probing experiments to gain insights into inner workings of these models for SER. We find that using features from a single optimal layer of a speech model reduces the error rate by 32% on average across seven datasets when compared to systems where features from all layers of speech models are used. We also achieve state-of-the-art results for German and Persian languages. Our probing results indicate that the middle layers of speech models capture the most important emotional information for speech emotion recognition.
Authors
(none)
Tags
Stats
Related papers
- Towards Interpretable And Transferable Speech Emotion Recognition: Latent Representation Based Analysis Of Features, Methods And Corpora (2021)0.00
- Multilingual Speech Emotion Recognition With Multi-gating Mechanism And Neural Architecture Search (2022)2.26
- Cross-lingual Speech Emotion Recognition: Humans Vs. Self-supervised Models (2024)5.84
- Are Paralinguistic Representations All That Is Needed For Speech Emotion Recognition? (2024)0.00
- Probing Speech Emotion Recognition Transformers For Linguistic Knowledge (2022)9.59
- An Analysis Of Large Speech Models-based Representations For Speech Emotion Recognition (2023)4.52
- A Layer-anchoring Strategy For Enhancing Cross-lingual Speech Emotion Recognition (2024)0.00
- Leveraging Cross-attention Transformer And Multi-feature Fusion For Cross-linguistic Speech Emotion Recognition (2025)4.52