Recursive Joint Attention For Audio-visual Fusion In Regression Based Emotion Recognition
2023 Β· R Gnana Praveen, Eric Granger, Patrick Cardinal
Abstract
In video-based emotion recognition (ER), it is important to effectively leverage the complementary relationship among audio (A) and visual (V) modalities, while retaining the intra-modal characteristics of individual modalities. In this paper, a recursive joint attention model is proposed along with long short-term memory (LSTM) modules for the fusion of vocal and facial expressions in regression-based ER. Specifically, we investigated the possibility of exploiting the complementary nature of A and V modalities using a joint cross-attention model in a recursive fashion with LSTMs to capture the intra-modal temporal dependencies within the same modalities as well as among the A-V feature representations. By integrating LSTMs with recursive joint cross-attention, our model can efficiently leverage both intra- and inter-modal relationships for the fusion of A and V modalities. The results of extensive experiments performed on the challenging Affwild2 and Fatigue (private) datasets indicat
Authors
(none)
Tags
Stats
Related papers
- Recursive Joint Cross-modal Attention For Multimodal Fusion In Dimensional Emotion Recognition (2024)11.39
- A Joint Cross-attention Model For Audio-visual Fusion In Dimensional Emotion Recognition (2022)18.00
- Audio Visual Emotion Recognition With Temporal Alignment And Perception Attention (2016)0.00
- United We Stand, Divided We Fall: Handling Weak Complementary Relationships For Audio-visual Emotion Recognition In Valence-arousal Space (2025)4.52
- Temporal Aggregation Of Audio-visual Modalities For Emotion Recognition (2020)8.09
- Attentive Fusion Enhanced Audio-visual Encoding For Transformer Based Robust Speech Recognition (2020)0.00
- TAGF: Time-aware Gated Fusion For Multimodal Valence-arousal Estimation (2025)0.00
- Attentive Modality Hopping Mechanism For Speech Emotion Recognition (2019)0.00