Bridging Information Asymmetry In Text-video Retrieval: A Data-centric Approach
2024 Β· Zechen Bai, Tianjun Xiao, Tong He, et al.
Abstract
As online video content rapidly grows, the task of text-video retrieval (TVR) becomes increasingly important. A key challenge in TVR is the information asymmetry between video and text: videos are inherently richer in information, while their textual descriptions often capture only fragments of this complexity. This paper introduces a novel, data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content. During training, videos are segmented into event-level clips and captioned to ensure comprehensive coverage. During retrieval, a large language model (LLM) generates semantically diverse queries to capture a broader range of possible matches. To enhance retrieval efficiency, we propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy. Our method achieves state-of-the-art results across multiple benchmarks, demonstrating the power of data-c
Authors
(none)
Tags
Stats
Related papers
- UATVR: Uncertainty-adaptive Text-video Retrieval (2023)15.46
- Dual Encoding For Video Retrieval By Text (2020)16.05
- Narrating The Video: Boosting Text-video Retrieval Via Comprehensive Utilization Of Frame-level Captions (2025)6.77
- Use What You Have: Video Retrieval Using Representations From Collaborative Experts (2019)0.00
- Tree-augmented Cross-modal Encoding For Complex-query Video Retrieval (2020)15.57
- An Overview Of Challenges In Egocentric Text-video Retrieval (2023)0.00
- Ambiguity-restrained Text-video Representation Learning For Partially Relevant Video Retrieval (2025)5.84
- A Feature-space Multimodal Data Augmentation Technique For Text-video Retrieval (2022)12.43