Recursive Joint Cross-modal Attention For Multimodal Fusion In Dimensional Emotion Recognition
2024 Β· R. Gnana Praveen, Jahangir Alam
Abstract
Though multimodal emotion recognition has achieved significant progress over recent years, the potential of rich synergic relationships across the modalities is not fully exploited. In this paper, we introduce Recursive Joint Cross-Modal Attention (RJCMA) to effectively capture both intra- and inter-modal relationships across audio, visual, and text modalities for dimensional emotion recognition. In particular, we compute the attention weights based on cross-correlation between the joint audio-visual-text feature representations and the feature representations of individual modalities to simultaneously capture intra- and intermodal relationships across the modalities. The attended features of the individual modalities are again fed as input to the fusion model in a recursive mechanism to obtain more refined feature representations. We have also explored Temporal Convolutional Networks (TCNs) to improve the temporal modeling of the feature representations of individual modalities. Exten
Authors
(none)
Tags
Stats
Related papers
- A Joint Cross-attention Model For Audio-visual Fusion In Dimensional Emotion Recognition (2022)18.00
- Recursive Joint Attention For Audio-visual Fusion In Regression Based Emotion Recognition (2023)9.59
- United We Stand, Divided We Fall: Handling Weak Complementary Relationships For Audio-visual Emotion Recognition In Valence-arousal Space (2025)4.52
- Cross-modal Fusion Techniques For Utterance-level Emotion Recognition From Text And Speech (2023)9.59
- Is Cross-attention Preferable To Self-attention For Multi-modal Emotion Recognition? (2022)3.64
- Temporal Aggregation Of Audio-visual Modalities For Emotion Recognition (2020)8.09
- Multimodal Fusion With Deep Neural Networks For Audio-video Emotion Recognition (2019)0.00
- Multimodal Speech Emotion Recognition Using Audio And Text (2018)18.02