Exploring Attention Mechanisms For Multimodal Emotion Recognition In An Emergency Call Center Corpus
2023 Β· ThΓ©o Deschamps-Berger, Lori Lamel, Laurence Devillers
Abstract
The emotion detection technology to enhance human decision-making is an important research issue for real-world applications, but real-life emotion datasets are relatively rare and small. The experiments conducted in this paper use the CEMO, which was collected in a French emergency call center. Two pre-trained models based on speech and text were fine-tuned for speech emotion recognition. Using pre-trained Transformer encoders mitigates our data's limited and sparse nature. This paper explores the different fusion strategies of these modality-specific models. In particular, fusions with and without cross-attention mechanisms were tested to gather the most relevant information from both the speech and text encoders. We show that multimodal fusion brings an absolute gain of 4-9% with respect to either single modality and that the Symmetric multi-headed cross-attention mechanism performed better than late classical fusion approaches. Our experiments also suggest that for the real-life CE
Authors
(none)
Tags
Stats
Related papers
- Is Cross-attention Preferable To Self-attention For Multi-modal Emotion Recognition? (2022)3.64
- Cross-modal Fusion Techniques For Utterance-level Emotion Recognition From Text And Speech (2023)9.59
- Multi-modal Emotion Recognition By Text, Speech And Video Using Pretrained Transformers (2024)0.00
- Conversational Emotion Analysis Via Attention Mechanisms (2019)10.35
- Leveraging Cross-attention Transformer And Multi-feature Fusion For Cross-linguistic Speech Emotion Recognition (2025)4.52
- Audio-guided Fusion Techniques For Multimodal Emotion Analysis (2024)4.52
- Fusion Approaches For Emotion Recognition From Speech Using Acoustic And Text-based Features (2024)12.25
- Temporal Aggregation Of Audio-visual Modalities For Emotion Recognition (2020)8.09