Image Retrieval From Contextual Descriptions
2022 Β· Benno Krojer, Vaibhav Adlakha, Vibhav Vineet, et al.
Abstract
The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utterance. In order to measure to what extent current vision-and-language models master this ability, we devise a new multimodal challenge, Image Retrieval from Contextual Descriptions (ImageCoDe). In particular, models are tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description. As such, each description contains only the details that help distinguish between images. Because of this, descriptions tend to be complex in terms of syntax and discourse and require drawing pragmatic inferences. Images are sourced from both static pictures and video frames. We benchmark several state-of-the-art models, including both cross-encoders such as ViLBERT and bi-encoders such as CLIP, on ImageCoDe. Our results reveal that these models dramatically lag behind human performance: the best variant achi
Authors
(none)
Tags
Stats
Related papers
- Contextblip: Doubly Contextual Alignment For Contrastive Image Retrieval From Linguistically Complex Descriptions (2024)0.00
- Deepimagesearch: Benchmarking Multimodal Agents For Context-aware Image Retrieval In Visual Histories (2026)0.00
- Composed Video Retrieval Via Enriched Context And Discriminative Embeddings (2024)12.19
- Data Roaming And Quality Assessment For Composed Image Retrieval (2023)11.39
- Cir-cot: Towards Interpretable Composed Image Retrieval Via End-to-end Chain-of-thought Reasoning (2025)0.00
- Imageref-vl: Enabling Contextual Image Referencing In Vision-language Models (2025)1.91
- Calibclip: Contextual Calibration Of Dominant Semantics For Text-driven Image Retrieval (2025)0.00
- Enhancing Image Retrieval : A Comprehensive Study On Photo Search Using The CLIP Mode (2024)0.00