Learning Audio-guided Video Representation With Gated Attention For Video-text Retrieval
2025 Β· Boseung Jeong, Jicheol Park, Sungyeon Kim, et al.
Abstract
Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content. Moreover, traditional models that incorporate audio blindly utilize the audio input regardless of whether it is useful or not, resulting in suboptimal video representation. To address these limitations, we propose a novel video-text retrieval framework, Audio-guided VIdeo representation learning with GATEd attention (AVIGATE), that effectively leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals. In addition, we propose an adaptive margin-based contrastive loss to deal with the inherently unclear positive-negative relationship between video and text, which facilitates learning better vid
Authors
(none)
Tags
Stats
Related papers
- UATVR: Uncertainty-adaptive Text-video Retrieval (2023)15.46
- Reading-strategy Inspired Visual Representation Learning For Text-to-video Retrieval (2022)13.93
- VRAG: Region Attention Graphs For Content-based Video Retrieval (2022)0.00
- Narrating The Video: Boosting Text-video Retrieval Via Comprehensive Utilization Of Frame-level Captions (2025)6.77
- Bridging Information Asymmetry In Text-video Retrieval: A Data-centric Approach (2024)0.00
- Audio Does Matter: Importance-aware Multi-granularity Fusion For Video Moment Retrieval (2025)4.49
- X-CLIP: End-to-end Multi-grained Contrastive Learning For Video-text Retrieval (2022)18.12
- Audio-enhanced Text-to-video Retrieval Using Text-conditioned Feature Alignment (2023)11.08