Covr-r:reason-aware Composed Video Retrieval
2026 Β· Omkar Thawakar, Dmitry Demidov, Vaishnav Potlapalli, et al.
Abstract
Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval basel
Authors
(none)
Tags
Stats
Related papers
- Composed Video Retrieval Via Enriched Context And Discriminative Embeddings (2024)12.19
- From Play To Replay: Composed Video Retrieval For Temporally Fine-grained Videos (2025)0.00
- PREGEN: Uncovering Latent Thoughts In Composed Video Retrieval (2026)0.00
- Beyond Simple Edits: Composed Video Retrieval With Dense Modifications (2025)2.16
- X-aligner: Composed Visual Retrieval Without The Bells And Whistles (2026)0.00
- Egocvr: An Egocentric Benchmark For Fine-grained Composed Video Retrieval (2024)10.00
- ICSVR: Investigating Compositional And Syntactic Understanding In Video Retrieval Models (2023)8.92
- Cir-cot: Towards Interpretable Composed Image Retrieval Via End-to-end Chain-of-thought Reasoning (2025)0.00