A Little More Like This: Text-to-image Retrieval With Vision-language Models Using Relevance Feedback
2025 Β· Bulat Khaertdinov, Mirela Popa, Nava Tintarev
Abstract
Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom transformer-based model that integrates multimodal fine-grained features from relevant items. Finally
Authors
(none)
Tags
Stats
Related papers
- Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models And Vision Language Models (2024)8.82
- Toward Automatic Relevance Judgment Using Vision--language Models For Image--text Retrieval Evaluation (2024)0.00
- Genir: Generative Visual Feedback For Mental Image Retrieval (2025)0.00
- Imageref-vl: Enabling Contextual Image Referencing In Vision-language Models (2025)1.91
- DREAM: Improving Video-text Retrieval Through Relevance-based Augmentation Using Large Foundation Models (2024)2.26
- Infusing Fine-grained Visual Knowledge To Vision-language Models (2025)0.00
- Aligning Vision Models With Human Aesthetics In Retrieval: Benchmarks And Algorithms (2024)0.00
- A Multimodal Recaptioning Framework To Account For Perceptual Diversity Across Languages In Vision-language Modeling (2025)0.00