Multimodal Large Language Models For End-to-end Affective Computing: Benchmarking And Boosting With Generative Knowledge Prompting
2025 Β· Miaosen Luo, Jiesen Long, Zequn Li, et al.
Abstract
Multimodal Affective Computing (MAC) aims to recognize and interpret human emotions by integrating information from diverse modalities such as text, video, and audio. Recent advancements in Multimodal Large Language Models (MLLMs) have significantly reshaped the landscape of MAC by offering a unified framework for processing and aligning cross-modal information. However, practical challenges remain, including performance variability across complex MAC tasks and insufficient understanding of how architectural designs and data characteristics impact affective analysis. To address these gaps, we conduct a systematic benchmark evaluation of state-of-the-art open-source MLLMs capable of concurrently processing audio, visual, and textual modalities across multiple established MAC datasets. Our evaluation not only compares the performance of these MLLMs but also provides actionable insights into model optimization by analyzing the influence of model architectures and dataset properties. Furth
Authors
(none)
Tags
Stats
Related papers
- Llms Meet Multimodal Generation And Editing: A Survey (2024)5.48
- Multimodal Large Language Models: A Survey (2023)0.00
- Macaw-llm: Multi-modal Language Modeling With Image, Audio, Video, And Text Integration (2023)0.00
- A Review Of Multi-modal Large Language And Vision Models (2024)0.00
- Benchmarking Large Multimodal Models Against Common Corruptions (2024)2.89
- MCIF: Multimodal Crosslingual Instruction-following Benchmark From Scientific Talks (2025)0.00
- Measuring Audio's Impact On Correctness: Audio-contribution-aware Post-training Of Large Audio Language Models (2025)0.00
- C3LLM: Conditional Multimodal Content Generation Using Large Language Models (2024)0.00