Imagescope: Unifying Language-guided Image Retrieval Via Large Multimodal Model Collective Reasoning
2025 Β· Pengfei Luo, Jingbo Zhou, Tong Xu, et al.
Abstract
With the proliferation of images in online content, language-guided image retrieval (LGIR) has emerged as a research hotspot over the past decade, encompassing a variety of subtasks with diverse input forms. While the development of large multimodal models (LMMs) has significantly facilitated these tasks, existing approaches often address them in isolation, requiring the construction of separate systems for each task. This not only increases system complexity and maintenance costs, but also exacerbates challenges stemming from language ambiguity and complex image content, making it difficult for retrieval systems to provide accurate and reliable results. To this end, we propose ImageScope, a training-free, three-stage framework that leverages collective reasoning to unify LGIR tasks. The key insight behind the unification lies in the compositional nature of language, which transforms diverse LGIR tasks into a generalized text-to-image retrieval process, along with the reasoning of LMMs
Authors
(none)
Tags
Stats
Related papers
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- Cir-cot: Towards Interpretable Composed Image Retrieval Via End-to-end Chain-of-thought Reasoning (2025)0.00
- Generative Cross-modal Retrieval: Memorizing Images In Multimodal Language Models For Retrieval And Beyond (2024)8.35
- V-retrver: Evidence-driven Agentic Reasoning For Universal Multimodal Retrieval (2026)0.00
- Mcot-re: Multi-faceted Chain-of-thought And Re-ranking For Training-free Zero-shot Composed Image Retrieval (2025)0.00
- Leveraging Large Vision-language Model As User Intent-aware Encoder For Composed Image Retrieval (2024)3.58
- Lamra: Large Multimodal Model As Your Advanced Retrieval Assistant (2024)7.50
- Cotmr: Chain-of-thought Multi-scale Reasoning For Training-free Zero-shot Composed Image Retrieval (2025)0.00