Abstract

Cross-modal (e.g. image-text, video-text) retrieval is an important task in information retrieval and multimodal vision-language understanding field. Temporal understanding makes video-text retrieval more challenging than image-text retrieval. However, we find that the widely used video-text benchmarks have shortcomings in comprehensively assessing abilities of models, especially in temporal understanding, causing large-scale image-text pre-trained models can already achieve comparable zero-shot performance with video-text pre-trained models. In this paper, we introduce RTime, a novel temporal-emphasized video-text retrieval dataset. We first obtain videos of actions or events with significant temporality, and then reverse these videos to create harder negative samples. We then recruit annotators to judge the significance and reversibility of candidate videos, and write captions for qualified videos. We further adopt GPT-4 to extend more captions based on human-written captions. Our RT

Authors

(none)

Tags

  • Cross-Modal Hashing
  • Image Retrieval

Stats

  • citations3
  • S2 citationsβ€”
  • github stars9
  • HF likes0
  • heat score6.52
  • arxiv keydu2024reversed

Related papers

Reversed In Time: A Novel Temporal-emphasized Benchmark For Cross-modal Video-text Retrieval β€” learning-to-hash