SDIF-DA: A Shallow-to-deep Interaction Framework With Data Augmentation For Multi-modal Intent Detection
2023 Β· Shijue Huang, Libo Qin, Bingbing Wang, et al.
Abstract
Multi-modal intent detection aims to utilize various modalities to understand the user's intentions, which is essential for the deployment of dialogue systems in real-world scenarios. The two core challenges for multi-modal intent detection are (1) how to effectively align and fuse different features of modalities and (2) the limited labeled multi-modal intent training data. In this work, we introduce a shallow-to-deep interaction framework with data augmentation (SDIF-DA) to address the above challenges. Firstly, SDIF-DA leverages a shallow-to-deep interaction module to progressively and effectively align and fuse features across text, video, and audio modalities. Secondly, we propose a ChatGPT-based data augmentation approach to automatically augment sufficient training data. Experimental results demonstrate that SDIF-DA can effectively align and fuse multi-modal features by achieving state-of-the-art performance. In addition, extensive analyses show that the introduced data augmenta
Authors
(none)
Tags
Stats
Related papers
- AGIF: An Adaptive Graph-interactive Framework For Joint Multiple Intent Detection And Slot Filling (2020)14.11
- Speechgpt: Empowering Large Language Models With Intrinsic Cross-modal Conversational Abilities (2023)16.59
- Enhancing Multimodal Sentiment Analysis For Missing Modality Through Self-distillation And Unified Modality Cross-attention (2024)6.71
- Data-centric Improvements For Enhancing Multi-modal Understanding In Spoken Conversation Modeling (2024)0.00
- Addressing Gradient Misalignment In Data-augmented Training For Robust Speech Deepfake Detection (2025)0.00
- Improving Multimodal Speech Recognition By Data Augmentation And Speech Representations (2022)9.03
- Modality Dropout For Multimodal Device Directed Speech Detection Using Verbal And Non-verbal Features (2023)0.00
- Enhancing Real-world Active Speaker Detection With Multi-modal Extraction Pre-training (2024)5.24