Beyond Simple Edits: Composed Video Retrieval With Dense Modifications
2025 Β· Omkar Thawakar, Dmitry Demidov, Ritesh Thawkar, et al.
Abstract
Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content. The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text encoder, enabling precise alignment between dense query modifications and target videos. The proposed
Authors
(none)
Tags
Stats
Related papers
- Composed Video Retrieval Via Enriched Context And Discriminative Embeddings (2024)12.19
- Covr-r:reason-aware Composed Video Retrieval (2026)2.02
- From Play To Replay: Composed Video Retrieval For Temporally Fine-grained Videos (2025)0.00
- ICSVR: Investigating Compositional And Syntactic Understanding In Video Retrieval Models (2023)8.92
- Egocvr: An Egocentric Benchmark For Fine-grained Composed Video Retrieval (2024)10.00
- Video-adverb Retrieval With Compositional Adverb-action Embeddings (2023)0.00
- PREGEN: Uncovering Latent Thoughts In Composed Video Retrieval (2026)0.00
- RAVU: Retrieval Augmented Video Understanding With Compositional Reasoning Over Graph (2025)0.00