Cross-modal Fusion Techniques For Utterance-level Emotion Recognition From Text And Speech
2023 Β· Jiachen Luo, Huy Phan, Joshua Reiss
Abstract
Multimodal emotion recognition (MER) is a fundamental complex research problem due to the uncertainty of human emotional expression and the heterogeneity gap between different modalities. Audio and text modalities are particularly important for a human participant in understanding emotions. Although many successful attempts have been designed multimodal representations for MER, there still exist multiple challenges to be addressed: 1) bridging the heterogeneity gap between multimodal features and model inter- and intra-modal interactions of multiple modalities; 2) effectively and efficiently modelling the contextual dynamics in the conversation sequence. In this paper, we propose Cross-Modal RoBERTa (CM-RoBERTa) model for emotion detection from spoken audio and corresponding transcripts. As the core unit of the CM-RoBERTa, parallel self- and cross- attention is designed to dynamically capture inter- and intra-modal interactions of audio and text. Specially, the mid-level fusion and res
Authors
(none)
Tags
Stats
Related papers
- MMER: Multimodal Multi-task Learning For Speech Emotion Recognition (2022)10.07
- Fusion Approaches For Emotion Recognition From Speech Using Acoustic And Text-based Features (2024)12.25
- Audio-guided Fusion Techniques For Multimodal Emotion Analysis (2024)4.52
- Exploring Attention Mechanisms For Multimodal Emotion Recognition In An Emergency Call Center Corpus (2023)8.09
- MIAR: Modality Interaction And Alignment Representation Fuison For Multimodal Emotion (2026)0.00
- Is Cross-attention Preferable To Self-attention For Multi-modal Emotion Recognition? (2022)3.64
- Enhancing Modal Fusion By Alignment And Label Matching For Multimodal Emotion Recognition (2024)6.34
- Multimodal Emotion Recognition And Sentiment Analysis In Multi-party Conversation Contexts (2025)0.00