Context-i2w: Mapping Images To Context-dependent Words For Accurate Zero-shot Composed Image Retrieval
2023 Β· Yuanmin Tang, Jing Yu, Keke Gai, et al.
Abstract
Different from Composed Image Retrieval task that requires expensive labels for training task-specific models, Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent that could be related to domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to learn a more accurate image representation that has adaptive attention to the reference image for various manipulation descriptions. In this paper, we propose a novel context-dependent mapping network, named Context-I2W, for adaptively converting description-relevant Image information into a pseudo-word token composed of the description for accurate ZS-CIR. Specifically, an Intent View Selector first dynamically learns a rotation rule to map the identical image to a task-specific manipulation view. Then a Visual Target Extractor further captures local information covering the main targets in ZS-CIR tasks under the guidance of multiple learnable queries.
Authors
(none)
Tags
Stats
Related papers
- From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval (2025)0.00
- Denoise-i2w: Mapping Images To Denoising Words For Accurate Zero-shot Composed Image Retrieval (2024)1.40
- Missing Target-relevant Information Prediction With World Model For Accurate Zero-shot Composed Image Retrieval (2025)6.81
- Pic2word: Mapping Pictures To Words For Zero-shot Composed Image Retrieval (2023)20.24
- Fine-grained Zero-shot Composed Image Retrieval With Complementary Visual-semantic Integration (2026)1.24
- WISER: Wider Search, Deeper Thinking, And Adaptive Fusion For Training-free Zero-shot Composed Image Retrieval (2026)2.98
- Modality And Task Adaptation For Enhanced Zero-shot Composed Image Retrieval (2024)0.00
- HINT: Composed Image Retrieval With Dual-path Compositional Contextualized Network (2026)0.78