From Play To Replay: Composed Video Retrieval For Temporally Fine-grained Videos
2025 Β· Animesh Gupta, Jay Parmar, Ishan Rajendrakumar Dave, et al.
Abstract
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving, and provides 180K triplets drawn from FineGym and FineDiving datasets. Previous CoVR benchmarks, focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics
Authors
(none)
Tags
Stats
Related papers
- Covr-r:reason-aware Composed Video Retrieval (2026)2.02
- Composed Video Retrieval Via Enriched Context And Discriminative Embeddings (2024)12.19
- Egocvr: An Egocentric Benchmark For Fine-grained Composed Video Retrieval (2024)10.00
- PREGEN: Uncovering Latent Thoughts In Composed Video Retrieval (2026)0.00
- Beyond Simple Edits: Composed Video Retrieval With Dense Modifications (2025)2.16
- Lovr: A Benchmark For Long Video Retrieval In Multimodal Contexts (2025)0.00
- X-aligner: Composed Visual Retrieval Without The Bells And Whistles (2026)0.00
- Video-colbert: Contextualized Late Interaction For Text-to-video Retrieval (2025)5.24