Agent-based Modular Learning For Multimodal Emotion Recognition In Human-agent Systems
2025 Β· Matvey Nepomnyaschiy, Oleg Pereziabov, Anvar Tliamov, et al.
Abstract
Effective human-agent interaction (HAI) relies on accurate and adaptive perception of human emotional states. While multimodal deep learning models - leveraging facial expressions, speech, and textual cues - offer high accuracy in emotion recognition, their training and maintenance are often computationally intensive and inflexible to modality changes. In this work, we propose a novel multi-agent framework for training multimodal emotion recognition systems, where each modality encoder and the fusion classifier operate as autonomous agents coordinated by a central supervisor. This architecture enables modular integration of new modalities (e.g., audio features via emotion2vec), seamless replacement of outdated components, and reduced computational overhead during training. We demonstrate the feasibility of our approach through a proof-of-concept implementation supporting vision, audio, and text modalities, with the classifier serving as a shared decision-making agent. Our framework not
Authors
(none)
Tags
Stats
Related papers
- Effmulti: Efficiently Modeling Complex Multimodal Interactions For Emotion Analysis (2022)0.00
- Learning Alignment For Multimodal Emotion Recognition From Speech (2019)15.22
- Multimodal Emotion Recognition Using Transfer Learning From Speaker Recognition And Bert-based Models (2022)12.10
- Audio-guided Fusion Techniques For Multimodal Emotion Analysis (2024)4.52
- MIAR: Modality Interaction And Alignment Representation Fuison For Multimodal Emotion (2026)0.00
- ML-SAN: Multi-level Speaker-adaptive Network For Emotion Recognition In Conversations (2026)0.00
- Multimodal Emotion Recognition And Sentiment Analysis In Multi-party Conversation Contexts (2025)0.00
- Multimodal Speech Emotion Recognition And Ambiguity Resolution (2019)0.00