DLF: Disentangled-language-focused Multimodal Sentiment Analysis
2024 Β· Pan Wang, Qiang Zhou, Yawen Wu, et al.
Abstract
Multimodal Sentiment Analysis (MSA) leverages heterogeneous modalities, such as language, vision, and audio, to enhance the understanding of human sentiment. While existing models often focus on extracting shared information across modalities or directly fusing heterogeneous modalities, such approaches can introduce redundancy and conflicts due to equal treatment of all modalities and the mutual transfer of information between modality pairs. To address these issues, we propose a Disentangled-Language-Focused (DLF) multimodal representation learning framework, which incorporates a feature disentanglement module to separate modality-shared and modality-specific information. To further reduce redundancy and enhance language-targeted features, four geometric measures are introduced to refine the disentanglement process. A Language-Focused Attractor (LFA) is further developed to strengthen language representation by leveraging complementary modality-specific information through a language-
Authors
(none)
Tags
Stats
Related papers
- PSA-MF: Personality-sentiment Aligned Multi-level Fusion For Multimodal Sentiment Analysis (2025)0.00
- Enhancing Multimodal Sentiment Analysis For Missing Modality Through Self-distillation And Unified Modality Cross-attention (2024)6.71
- On The Use Of Modality-specific Large-scale Pre-trained Encoders For Multimodal Sentiment Analysis (2022)6.77
- Enriching Multimodal Sentiment Analysis Through Textual Emotional Descriptions Of Visual-audio Content (2024)10.48
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- MIAR: Modality Interaction And Alignment Representation Fuison For Multimodal Emotion (2026)0.00
- MSF-SER: Enriching Acoustic Modeling With Multi-granularity Semantics For Speech Emotion Recognition (2025)0.00
- CMSBERT-CLR: Context-driven Modality Shifting BERT With Contrastive Learning For Linguistic, Visual, Acoustic Representations (2022)4.52