Temporal Aggregation Of Audio-visual Modalities For Emotion Recognition
2020 Β· Andreea Birhala, Catalin Nicolae Ristea, Anamaria Radoi, et al.
Abstract
Emotion recognition has a pivotal role in affective computing and in human-computer interaction. The current technological developments lead to increased possibilities of collecting data about the emotional state of a person. In general, human perception regarding the emotion transmitted by a subject is based on vocal and visual information collected in the first seconds of interaction with the subject. As a consequence, the integration of verbal (i.e., speech) and non-verbal (i.e., image) information seems to be the preferred choice in most of the current approaches towards emotion recognition. In this paper, we propose a multimodal fusion technique for emotion recognition based on combining audio-visual modalities from a temporal window with different temporal offsets for each modality. We show that our proposed method outperforms other methods from the literature and human accuracy rating. The experiments are conducted over the open-access multimodal dataset CREMA-D.
Authors
(none)
Tags
Stats
Related papers
- Continuous Multimodal Emotion Recognition Approach For AVEC 2017 (2017)0.00
- A Joint Cross-attention Model For Audio-visual Fusion In Dimensional Emotion Recognition (2022)18.00
- Interpretable Multimodal Emotion Recognition Using Hybrid Fusion Of Speech And Image Data (2022)11.85
- Attentive Modality Hopping Mechanism For Speech Emotion Recognition (2019)0.00
- Recursive Joint Attention For Audio-visual Fusion In Regression Based Emotion Recognition (2023)9.59
- Audio-guided Fusion Techniques For Multimodal Emotion Analysis (2024)4.52
- TAGF: Time-aware Gated Fusion For Multimodal Valence-arousal Estimation (2025)0.00
- Emotion Recognition System From Speech And Visual Information Based On Convolutional Neural Networks (2020)10.21