Leveraging Large Vision-language Model As User Intent-aware Encoder For Composed Image Retrieval
2024 Β· Zelong Sun, Dong Jing, Guoxing Yang, et al.
Abstract
Composed Image Retrieval (CIR) aims to retrieve target images from candidate set using a hybrid-modality query consisting of a reference image and a relative caption that describes the user intent. Recent studies attempt to utilize Vision-Language Pre-training Models (VLPMs) with various fusion strategies for addressing the task.However, these methods typically fail to simultaneously meet two key requirements of CIR: comprehensively extracting visual information and faithfully following the user intent. In this work, we propose CIR-LVLM, a novel framework that leverages the large vision-language model (LVLM) as the powerful user intent-aware encoder to better meet these requirements. Our motivation is to explore the advanced reasoning and instruction-following capabilities of LVLM for accurately understanding and responding the user intent. Furthermore, we design a novel hybrid intent instruction module to provide explicit intent guidance at two levels: (1) The task prompt clarifies th
Authors
(none)
Tags
Stats
Related papers
- Compositional Image Retrieval Via Instruction-aware Contrastive Learning (2024)0.00
- Scaling Prompt Instructed Zero Shot Composed Image Retrieval With Image-only Data (2025)0.00
- Visual Delta Generator With Large Multi-modal Models For Semi-supervised Composed Image Retrieval (2024)9.03
- Sentence-level Prompts Benefit Composed Image Retrieval (2023)3.95
- Recall: Recalibrating Capability Degradation For Mllm-based Composed Image Retrieval (2026)2.90
- Cir-cot: Towards Interpretable Composed Image Retrieval Via End-to-end Chain-of-thought Reasoning (2025)0.00
- Image Retrieval On Real-life Images With Pre-trained Vision-and-language Models (2021)17.07
- Mcot-mvs: Multi-level Vision Selection By Multi-modal Chain-of-thought Reasoning For Composed Image Retrieval (2026)0.00