A Joint Cross-attention Model For Audio-visual Fusion In Dimensional Emotion Recognition
2022 Β· R. Gnana Praveen, Wheidima Carneiro de Melo, Nasib Ullah, et al.
Abstract
Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities (e.g., audio, visual, biosignals, etc.), and can provide some robustness to noisy modalities. Most state-of-the-art methods for audio-visual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. Specifically, we propose a joint cross-attention model that relies on the complementary relationships to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. The proposed fusion model efficiently leverages the inter-modal relationships, while reducing the heterogeneity between the features. In particular, it computes the cross-attention wei
Authors
(none)
Tags
Stats
Related papers
- Recursive Joint Cross-modal Attention For Multimodal Fusion In Dimensional Emotion Recognition (2024)11.39
- Recursive Joint Attention For Audio-visual Fusion In Regression Based Emotion Recognition (2023)9.59
- Is Cross-attention Preferable To Self-attention For Multi-modal Emotion Recognition? (2022)3.64
- Continuous Multimodal Emotion Recognition Approach For AVEC 2017 (2017)0.00
- Multimodal Fusion With Deep Neural Networks For Audio-video Emotion Recognition (2019)0.00
- Temporal Aggregation Of Audio-visual Modalities For Emotion Recognition (2020)8.09
- Cross-modal Fusion Techniques For Utterance-level Emotion Recognition From Text And Speech (2023)9.59
- Mutilmodal Feature Extraction And Attention-based Fusion For Emotion Estimation In Videos (2023)1.40