WDMIR: Wavelet-driven Multimodal Intent Recognition
2025 Β· Weiyin Gong, Kai Zhang, Yanghai Zhang, et al.
Abstract
Multimodal intent recognition (MIR) seeks to accurately interpret user intentions by integrating verbal and non-verbal information across video, audio and text modalities. While existing approaches prioritize text analysis, they often overlook the rich semantic content embedded in non-verbal cues. This paper presents a novel Wavelet-Driven Multimodal Intent Recognition(WDMIR) framework that enhances intent understanding through frequency-domain analysis of non-verbal information. To be more specific, we propose: (1) a wavelet-driven fusion module that performs synchronized decomposition and integration of video-audio features in the frequency domain, enabling fine-grained analysis of temporal dynamics; (2) a cross-modal interaction mechanism that facilitates progressive feature enhancement from bimodal to trimodal integration, effectively bridging the semantic gap between verbal and non-verbal information. Extensive experiments on MIntRec demonstrate that our approach achieves state-of
Authors
(none)
Tags
Stats
Related papers
- Mintrec: A New Dataset For Multimodal Intent Recognition (2022)17.08
- WAVE: Learning Unified & Versatile Audio-visual Embeddings With Multimodal LLM (2025)0.00
- MIAR: Modality Interaction And Alignment Representation Fuison For Multimodal Emotion (2026)0.00
- Temporal Working Memory: Query-guided Segment Refinement For Enhanced Multimodal Understanding (2025)11.33
- Integrating Audio, Visual, And Semantic Information For Enhanced Multimodal Speaker Diarization (2024)0.00
- Robust Wake Word Spotting With Frame-level Cross-modal Attention Based Audio-visual Conformer (2024)5.24
- Token-level Contrastive Learning With Modality-aware Prompting For Multimodal Intent Recognition (2023)14.17
- Semantic Matters: Multimodal Features For Affective Analysis (2025)0.00