Audio-guided Fusion Techniques For Multimodal Emotion Analysis
2024 Β· Pujin Shi, Fei Gao
Abstract
In this paper, we propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. First, in order to enhance the performance of the feature extractor on sentiment classification tasks,we fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. This approach effectively preserves the original emotional information conveyed in the videos. Second, we propose an Audio-Guided Transformer (AGT) fusion mechanism, which leverages the robustness of Hubert-large, showing superior effectiveness in fusing both inter-channel and intra-channel information. Third, To enhance the accuracy of the model, we iteratively apply self-supervised learning by using high-confidence unlabeled data as pseudo-labels. Finally, through black-box probing, we discovered an imbalanced data distribution between the training and test sets. Therefore, We adopt a prior-knowledge-based voting mechanism. The results demonstrate the effectiveness of our s
Authors
(none)
Tags
Stats
Related papers
- Mutilmodal Feature Extraction And Attention-based Fusion For Emotion Estimation In Videos (2023)1.40
- Enhancing Modal Fusion By Alignment And Label Matching For Multimodal Emotion Recognition (2024)6.34
- Leveraging Label Potential For Enhanced Multimodal Emotion Recognition (2025)0.00
- Cross-modal Fusion Techniques For Utterance-level Emotion Recognition From Text And Speech (2023)9.59
- Multistage Linguistic Conditioning Of Convolutional Layers For Speech Emotion Recognition (2021)9.23
- Fusion Approaches For Emotion Recognition From Speech Using Acoustic And Text-based Features (2024)12.25
- Interpretable Multimodal Emotion Recognition Using Hybrid Fusion Of Speech And Image Data (2022)11.85
- Continuous Multimodal Emotion Recognition Approach For AVEC 2017 (2017)0.00