Fine-grained Zero-shot Composed Image Retrieval With Complementary Visual-semantic Integration
2026 Β· Yongcong Ye, Kai Zhang, Yanghai Zhang, et al.
Abstract
Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications, allowing users to retrieve a target image by providing a reference image and a relative caption describing the desired modifications. Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively. They primarily rely on either transforming the multimodal query into a single text using image-to-text models or employing large language models for target image description generation, approaches that often fail to capture complementary visual information and complete semantic context. To address these limitations, we propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration (CVSI). Specifically, CVSI leverages three key components: (1) Visual Information Extraction, which not only extracts global image features but also uses a pre-trained mapping network to
Authors
(none)
Tags
Stats
Related papers
- From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval (2025)0.00
- SETR: A Two-stage Semantic-enhanced Framework For Zero-shot Composed Image Retrieval (2025)0.00
- WISER: Wider Search, Deeper Thinking, And Adaptive Fusion For Training-free Zero-shot Composed Image Retrieval (2026)2.98
- Data-efficient Generalization For Zero-shot Composed Image Retrieval (2025)2.26
- Isearle: Improving Textual Inversion For Zero-shot Composed Image Retrieval (2024)12.09
- Zero-shot Composed Image Retrieval With Textual Inversion (2023)19.84
- Cotmr: Chain-of-thought Multi-scale Reasoning For Training-free Zero-shot Composed Image Retrieval (2025)0.00
- Multimodal Reasoning Agent For Zero-shot Composed Image Retrieval (2025)0.00