Compositional Image Retrieval Via Instruction-aware Contrastive Learning
2024 Β· Wenliang Zhong, Weizhi An, Feng Jiang, et al.
Abstract
Composed Image Retrieval (CIR) involves retrieving a target image based on a composed query of an image paired with text that specifies modifications or changes to the visual reference. CIR is inherently an instruction-following task, as the model needs to interpret and apply modifications to the image. In practice, due to the scarcity of annotated data in downstream tasks, Zero-Shot CIR (ZS-CIR) is desirable. While existing ZS-CIR models based on CLIP have shown promising results, their capability in interpreting and following modification instructions remains limited. Some research attempts to address this by incorporating Large Language Models (LLMs). However, these approaches still face challenges in effectively integrating multimodal information and instruction understanding. To tackle above challenges, we propose a novel embedding method utilizing an instruction-tuned Multimodal LLM (MLLM) to generate composed representation, which significantly enhance the instruction following
Authors
(none)
Tags
Stats
Related papers
- From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval (2025)0.00
- Scaling Prompt Instructed Zero Shot Composed Image Retrieval With Image-only Data (2025)0.00
- Multimodal Reasoning Agent For Zero-shot Composed Image Retrieval (2025)0.00
- Image2sentence Based Asymmetrical Zero-shot Composed Image Retrieval (2024)0.00
- Leveraging Large Vision-language Model As User Intent-aware Encoder For Composed Image Retrieval (2024)3.58
- Instance-level Composed Image Retrieval (2025)0.00
- Context-cir: Learning From Concepts In Text For Composed Image Retrieval (2025)4.67
- Mcot-re: Multi-faceted Chain-of-thought And Re-ranking For Training-free Zero-shot Composed Image Retrieval (2025)0.00