Multimodal Fusion Method With Spatiotemporal Sequences And Relationship Learning For Valence-arousal Estimation
2024 Β· Jun Yu, Gongpeng Zhao, Yongqi Wang, et al.
Abstract
This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABAW6 competition. We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features. Through the utilization of Temporal Convolutional Network (TCN) modules, we effectively captured the temporal and spatial correlations between these features. Subsequently, we employed a Transformer encoder structure to learn long-range dependencies, thereby enhancing the model's performance and generalization ability. Our method leverages a multimodal data fusion approach, integrating pre-trained audio and video backbones for feature extraction, followed by TCN-based spatiotemporal encoding and Transformer-based temporal information capture. Experimental results demonstrate the effectiveness of our approach, achieving competitive performance in VA estimation on the AffWild2 dataset.
Authors
(none)
Tags
Stats
Related papers
- Multi-modal Continuous Valence And Arousal Prediction In The Wild Using Deep 3D Features And Sequence Modeling (2020)0.00
- SUN Team's Contribution To ABAW 2024 Competition: Audio-visual Valence-arousal Estimation And Expression Recognition (2024)0.00
- TAGF: Time-aware Gated Fusion For Multimodal Valence-arousal Estimation (2025)0.00
- A Joint Cross-attention Model For Audio-visual Fusion In Dimensional Emotion Recognition (2022)18.00
- Self-relation Attention And Temporal Awareness For Emotion Recognition Via Vocal Burst (2022)4.18
- Recursive Joint Attention For Audio-visual Fusion In Regression Based Emotion Recognition (2023)9.59
- Temporal Aggregation Of Audio-visual Modalities For Emotion Recognition (2020)8.09
- Mutilmodal Feature Extraction And Attention-based Fusion For Emotion Estimation In Videos (2023)1.40