Genir: Generative Visual Feedback For Mental Image Retrieval
2025 Β· Diji Yang, Minghao Liu, Chung-Hsiang Lo, et al.
Abstract
Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind. That is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative
Authors
(none)
Tags
Stats
Related papers
- A Little More Like This: Text-to-image Retrieval With Vision-language Models Using Relevance Feedback (2025)0.00
- Irgen: Generative Modeling For Image Retrieval (2023)7.16
- Imageref-vl: Enabling Contextual Image Referencing In Vision-language Models (2025)1.91
- Visual-rag: Benchmarking Text-to-image Retrieval Augmented Generation For Visual Knowledge Intensive Queries (2025)0.00
- Mrag-bench: Vision-centric Evaluation For Retrieval-augmented Multimodal Models (2024)0.00
- Generative Cross-modal Retrieval: Memorizing Images In Multimodal Language Models For Retrieval And Beyond (2024)8.35
- MIRACL-VISION: A Large, Multilingual, Visual Document Retrieval Benchmark (2025)0.00
- Visual Model Checking: Graph-based Inference Of Visual Routines For Image Retrieval (2026)0.00