Enhancing Modal Fusion By Alignment And Label Matching For Multimodal Emotion Recognition
2024 Β· Qifei Li, Yingming Gao, Yuhua Wen, et al.
Abstract
To address the limitation in multimodal emotion recognition (MER) performance arising from inter-modal information fusion, we propose a novel MER framework based on multitask learning where fusion occurs after alignment, called Foal-Net. The framework is designed to enhance the effectiveness of modality fusion and includes two auxiliary tasks: audio-video emotion alignment (AVEL) and cross-modal emotion label matching (MEM). First, AVEL achieves alignment of emotional information in audio-video representations through contrastive learning. Then, a modal fusion network integrates the aligned features. Meanwhile, MEM assesses whether the emotions of the current sample pair are the same, providing assistance for modal information fusion and guiding the model to focus more on emotional information. The experimental results conducted on IEMOCAP corpus show that Foal-Net outperforms the state-of-the-art methods and emotion alignment is necessary before modal fusion.
Authors
(none)
Tags
Stats
Related papers
- MIAR: Modality Interaction And Alignment Representation Fuison For Multimodal Emotion (2026)0.00
- MMER: Multimodal Multi-task Learning For Speech Emotion Recognition (2022)10.07
- Leveraging Label Potential For Enhanced Multimodal Emotion Recognition (2025)0.00
- Audio-guided Fusion Techniques For Multimodal Emotion Analysis (2024)4.52
- Cross-modal Fusion Techniques For Utterance-level Emotion Recognition From Text And Speech (2023)9.59
- Group Gated Fusion On Attention-based Bidirectional Alignment For Multimodal Emotion Recognition (2022)11.39
- Learning Alignment For Multimodal Emotion Recognition From Speech (2019)15.22
- A Joint Cross-attention Model For Audio-visual Fusion In Dimensional Emotion Recognition (2022)18.00