MIAR: Modality Interaction And Alignment Representation Fuison For Multimodal Emotion
2026 Β· Jichao Zhu, Jun Yu
Abstract
Multimodal Emotion Recognition (MER) aims to perceive human emotions through three modes: language, vision, and audio. Previous methods primarily focused on modal fusion without adequately addressing significant distributional differences among modalities or considering their varying contributions to the task. They also lacked robust generalization capabilities across diverse textual model features, thus limiting performance in multimodal scenarios. Therefore, we propose a novel approach called Modality Interaction and Alignment Representation (MIAR). This network integrates contextual features across different modalities using a feature interaction to generate feature tokens to represent global representations of this modality extracting information from other modalities. These four tokens represent global representations of how each modality extracts information from others. MIAR aligns different modalities using contrastive learning and normalization strategies. We conduct experimen
Authors
(none)
Tags
Stats
Related papers
- Enhancing Modal Fusion By Alignment And Label Matching For Multimodal Emotion Recognition (2024)6.34
- Cross-modal Fusion Techniques For Utterance-level Emotion Recognition From Text And Speech (2023)9.59
- Exploiting Modality-invariant Feature For Robust Multimodal Emotion Recognition With Missing Modalities (2022)3.16
- Leveraging Label Potential For Enhanced Multimodal Emotion Recognition (2025)0.00
- MMER: Multimodal Multi-task Learning For Speech Emotion Recognition (2022)10.07
- Effmulti: Efficiently Modeling Complex Multimodal Interactions For Emotion Analysis (2022)0.00
- Learning Alignment For Multimodal Emotion Recognition From Speech (2019)15.22
- Agent-based Modular Learning For Multimodal Emotion Recognition In Human-agent Systems (2025)0.00