Seeing Through Words: Controlling Visual Retrieval Quality With Language Models
2026 Β· Jianglin Lu, Simon Jenni, Kushal Kafle, et al.
Abstract
Text-to-image retrieval is a fundamental task in vision-language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, rendering them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality. Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms that capture fine-grained visual attributes such as pose, scene, and aesthetics. We introduce a general framework that conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, so that query enrichment is not only semant
Authors
(none)
Tags
Stats
Related papers
- Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models And Vision Language Models (2024)8.82
- A Little More Like This: Text-to-image Retrieval With Vision-language Models Using Relevance Feedback (2025)0.00
- Aligning Vision Models With Human Aesthetics In Retrieval: Benchmarks And Algorithms (2024)0.00
- Enhancing Image Quality Assessment Ability Of Lmms Via Retrieval-augmented Generation (2026)0.00
- Cross-modal RAG: Sub-dimensional Text-to-image Retrieval-augmented Generation (2025)0.00
- Learning The Visualness Of Text Using Large Vision-language Models (2023)4.52
- Interactive Text-to-image Retrieval With Large Language Models: A Plug-and-play Approach (2024)10.24
- Recqr: Incorporating Conversational Query Rewriting To Improve Multimodal Image Retrieval (2026)0.00