From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval
2025 Β· Yabing Wang, Zhuotao Tian, Qingpei Guo, et al.
Abstract
Composed Image Retrieval (CIR) is a challenging multimodal task that retrieves a target image based on a reference image and accompanying modification text. Due to the high cost of annotating CIR triplet datasets, zero-shot (ZS) CIR has gained traction as a promising alternative. Existing studies mainly focus on projection-based methods, which map an image to a single pseudo-word token. However, these methods face three critical challenges: (1) insufficient pseudo-word token representation capacity, (2) discrepancies between training and inference phases, and (3) reliance on large-scale synthetic data. To address these issues, we propose a two-stage framework where the training is accomplished from mapping to composing. In the first stage, we enhance image-to-pseudo-word token learning by introducing a visual semantic injection module and a soft text alignment objective, enabling the token to capture richer and fine-grained image information. In the second stage, we optimize the text e
Authors
(none)
Tags
Stats
Related papers
- Multimodal Reasoning Agent For Zero-shot Composed Image Retrieval (2025)0.00
- Modality And Task Adaptation For Enhanced Zero-shot Composed Image Retrieval (2024)0.00
- Fine-grained Zero-shot Composed Image Retrieval With Complementary Visual-semantic Integration (2026)1.24
- Reason-before-retrieve: One-stage Reflective Chain-of-thoughts For Training-free Zero-shot Composed Image Retrieval (2024)10.03
- Zero-shot Composed Text-image Retrieval (2023)0.00
- Training-free Zero-shot Composed Image Retrieval With Local Concept Reranking (2023)0.00
- Isearle: Improving Textual Inversion For Zero-shot Composed Image Retrieval (2024)12.09
- Cotmr: Chain-of-thought Multi-scale Reasoning For Training-free Zero-shot Composed Image Retrieval (2025)0.00