Composed Video Retrieval Via Enriched Context And Discriminative Embeddings
2024 Β· Omkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, et al.
Abstract
Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases. Existing works predominantly rely on visual queries combined with modification text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information and learns discriminative embeddings of vision only, text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art performance f
Authors
(none)
Tags
Stats
Related papers
- Covr-r:reason-aware Composed Video Retrieval (2026)2.02
- PREGEN: Uncovering Latent Thoughts In Composed Video Retrieval (2026)0.00
- Beyond Simple Edits: Composed Video Retrieval With Dense Modifications (2025)2.16
- X-aligner: Composed Visual Retrieval Without The Bells And Whistles (2026)0.00
- From Play To Replay: Composed Video Retrieval For Temporally Fine-grained Videos (2025)0.00
- ICSVR: Investigating Compositional And Syntactic Understanding In Video Retrieval Models (2023)8.92
- Egocvr: An Egocentric Benchmark For Fine-grained Composed Video Retrieval (2024)10.00
- Composed Image Retrieval Using Contrastive Learning And Task-oriented Clip-based Features (2023)16.84