Reversed In Time: A Novel Temporal-emphasized Benchmark For Cross-modal Video-text Retrieval
2024 Β· Yang Du, Yuqi Liu, Qin Jin
Abstract
Cross-modal (e.g. image-text, video-text) retrieval is an important task in information retrieval and multimodal vision-language understanding field. Temporal understanding makes video-text retrieval more challenging than image-text retrieval. However, we find that the widely used video-text benchmarks have shortcomings in comprehensively assessing abilities of models, especially in temporal understanding, causing large-scale image-text pre-trained models can already achieve comparable zero-shot performance with video-text pre-trained models. In this paper, we introduce RTime, a novel temporal-emphasized video-text retrieval dataset. We first obtain videos of actions or events with significant temporality, and then reverse these videos to create harder negative samples. We then recruit annotators to judge the significance and reversibility of candidate videos, and write captions for qualified videos. We further adopt GPT-4 to extend more captions based on human-written captions. Our RT
Authors
(none)
Tags
Stats
Related papers
- Dissecting Temporal Understanding In Text-to-audio Retrieval (2024)3.58
- Fighting Fire With FIRE: Assessing The Validity Of Text-to-video Retrieval Benchmarks (2022)0.00
- Tempme: Video Temporal Token Merging For Efficient Text-video Retrieval (2024)2.86
- Towards Efficient And Robust Moment Retrieval System: A Unified Framework For Multi-granularity Models And Temporal Reranking (2025)2.26
- Lovr: A Benchmark For Long Video Retrieval In Multimodal Contexts (2025)0.00
- Dual-modal Attention-enhanced Text-video Retrieval With Triplet Partial Margin Contrastive Learning (2023)8.82
- Temporal Perceiving Video-language Pre-training (2023)0.00
- Madtempo: An Interactive System For Multi-event Temporal Video Retrieval With Query Augmentation (2025)0.00