Imageref-vl: Enabling Contextual Image Referencing In Vision-language Models
2025 Β· Jingwei Yi, Junhao Yin, Ju Xu, et al.
Abstract
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in understanding multimodal inputs and have been widely integrated into Retrieval-Augmented Generation (RAG) based conversational systems. While current VLM-powered chatbots can provide textual source references in their responses, they exhibit significant limitations in referencing contextually relevant images during conversations. In this paper, we introduce Contextual Image Reference -- the ability to appropriately reference relevant images from retrieval documents based on conversation context -- and systematically investigate VLMs' capability in this aspect. We conduct the first evaluation for contextual image referencing, comprising a dedicated testing dataset and evaluation metrics. Furthermore, we propose ImageRef-VL, a method that significantly enhances open-source VLMs' image referencing capabilities through instruction fine-tuning on a large-scale, manually curated multimodal conversation dataset. Experi
Authors
(none)
Tags
Stats
Related papers
- A Little More Like This: Text-to-image Retrieval With Vision-language Models Using Relevance Feedback (2025)0.00
- Toward Automatic Relevance Judgment Using Vision--language Models For Image--text Retrieval Evaluation (2024)0.00
- Genir: Generative Visual Feedback For Mental Image Retrieval (2025)0.00
- Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models And Vision Language Models (2024)8.82
- Retrieving Counterfactuals Improves Visual In-context Learning (2026)2.20
- Leveraging Large Vision-language Model As User Intent-aware Encoder For Composed Image Retrieval (2024)3.58
- A Multimodal Recaptioning Framework To Account For Perceptual Diversity Across Languages In Vision-language Modeling (2025)0.00
- Mrag-bench: Vision-centric Evaluation For Retrieval-augmented Multimodal Models (2024)0.00