Deepimagesearch: Benchmarking Multimodal Agents For Context-aware Image Retrieval In Visual Histories
2026 Β· Chenlong Deng, Mengjie Deng, Junjie Wu, et al.
Abstract
Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust b
Authors
(none)
Tags
Stats
Related papers
- IDMR: Towards Instance-driven Precise Visual Correspondence In Multimodal Retrieval (2025)2.29
- Vision-deepresearch Benchmark: Rethinking Visual And Textual Search For Multimodal Large Language Models (2026)7.27
- Image Retrieval From Contextual Descriptions (2022)8.09
- MM-BRIGHT: A Multi-task Multimodal Benchmark For Reasoning-intensive Retrieval (2026)2.60
- Visual Haystacks: A Vision-centric Needle-in-a-haystack Benchmark (2024)0.00
- Connecting Images Through Time And Sources: Introducing Low-data, Heterogeneous Instance Retrieval (2021)0.00
- Entity Image And Mixed-modal Image Retrieval Datasets (2025)1.56
- A Multimodal Deep Learning Framework For Scalable Content Based Visual Media Retrieval (2021)0.00