Fighting Fire With FIRE: Assessing The Validity Of Text-to-video Retrieval Benchmarks
2022 Β· Pedro Rodriguez, Mahmoud Azab, Becka Silvert, et al.
Abstract
Searching troves of videos with textual descriptions is a core multimodal retrieval task. Owing to the lack of a purpose-built dataset for text-to-video retrieval, video captioning datasets have been re-purposed to evaluate models by (1) treating captions as positive matches to their respective videos and (2) assuming all other videos to be negatives. However, this methodology leads to a fundamental flaw during evaluation: since captions are marked as relevant only to their original video, many alternate videos also match the caption, which introduces false-negative caption-video pairs. We show that when these false negatives are corrected, a recent state-of-the-art model gains 25% recall points -- a difference that threatens the validity of the benchmark itself. To diagnose and mitigate this issue, we annotate and release 683K additional caption-video pairs. Using these, we recompute effectiveness scores for three models on two standard benchmarks (MSR-VTT and MSVD). We find that (1)
Authors
(none)
Tags
Stats
Related papers
- Are Synthetic Videos Useful? A Benchmark For Retrieval-centric Evaluation Of Synthetic Videos (2025)4.52
- Learning Video Retrieval Models With Relevance-aware Online Mining (2022)6.07
- Narrating The Video: Boosting Text-video Retrieval Via Comprehensive Utilization Of Frame-level Captions (2025)6.77
- Reversed In Time: A Novel Temporal-emphasized Benchmark For Cross-modal Video-text Retrieval (2024)6.52
- On Semantic Similarity In Video Retrieval (2021)12.81
- Distilling Vision-language Models On Millions Of Videos (2024)7.50
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Lovr: A Benchmark For Long Video Retrieval In Multimodal Contexts (2025)0.00