Narrating The Video: Boosting Text-video Retrieval Via Comprehensive Utilization Of Frame-level Captions
2025 Β· Chan Hur, Jeong-Hun Hong, Dong-Hun Lee, et al.
Abstract
In recent text-video retrieval, the use of additional captions from vision-language models has shown promising effects on the performance. However, existing models using additional captions often have struggled to capture the rich semantics, including temporal changes, inherent in the video. In addition, incorrect information caused by generative models can lead to inaccurate retrieval. To address these issues, we propose a new framework, Narrating the Video (NarVid), which strategically leverages the comprehensive information available from frame-level captions, the narration. The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity, and 4) hard-negative loss to learn discriminative features from multiple perspectives using the
Authors
(none)
Tags
Stats
Related papers
- Cap4video: What Can Auxiliary Captions Do For Text-video Retrieval? (2022)20.22
- Distilling Vision-language Models On Millions Of Videos (2024)7.50
- Bridging Information Asymmetry In Text-video Retrieval: A Data-centric Approach (2024)0.00
- DREAM: Improving Video-text Retrieval Through Relevance-based Augmentation Using Large Foundation Models (2024)2.26
- Fighting Fire With FIRE: Assessing The Validity Of Text-to-video Retrieval Benchmarks (2022)0.00
- Towards Holistic Language-video Representation: The Language Model-enhanced Msr-video To Text Dataset (2024)0.00
- Learning Audio-guided Video Representation With Gated Attention For Video-text Retrieval (2025)5.24
- Learning Video Retrieval Models With Relevance-aware Online Mining (2022)6.07