MAVEN: Multi-modal Attention For Valence-arousal Emotion Network
2025 Β· Vrushank Ahire, Kunal Shah, Mudasir Nazir Khan, et al.
Abstract
Dynamic emotion recognition in the wild remains challenging due to the transient nature of emotional expressions and temporal misalignment of multi-modal cues. Traditional approaches predict valence and arousal and often overlook the inherent correlation between these two dimensions. The proposed Multi-modal Attention for Valence-Arousal Emotion Network (MAVEN) integrates visual, audio, and textual modalities through a bi-directional cross-modal attention mechanism. MAVEN uses modality-specific encoders to extract features from synchronized video frames, audio segments, and transcripts, predicting emotions in polar coordinates following Russell's circumplex model. The evaluation of the Aff-Wild2 dataset using MAVEN achieved a concordance correlation coefficient (CCC) of 0.3061, surpassing the ResNet-50 baseline model with a CCC of 0.22. The multistage architecture captures the subtle and transient nature of emotional expressions in conversational videos and improves emotion recognition
Authors
(none)
Tags
Stats
Related papers
- Continuous Multimodal Emotion Recognition Approach For AVEC 2017 (2017)0.00
- SUN Team's Contribution To ABAW 2024 Competition: Audio-visual Valence-arousal Estimation And Expression Recognition (2024)0.00
- Multi-modal Continuous Valence And Arousal Prediction In The Wild Using Deep 3D Features And Sequence Modeling (2020)0.00
- MMVA: Multimodal Matching Based On Valence And Arousal Across Images, Music, And Musical Captions (2025)0.00
- MIAR: Modality Interaction And Alignment Representation Fuison For Multimodal Emotion (2026)0.00
- Multimodal Fusion Method With Spatiotemporal Sequences And Relationship Learning For Valence-arousal Estimation (2024)0.00
- Recursive Joint Attention For Audio-visual Fusion In Regression Based Emotion Recognition (2023)9.59
- Framewise Approach In Multimodal Emotion Recognition In OMG Challenge (2018)0.00