WISER: Wider Search, Deeper Thinking, And Adaptive Fusion For Training-free Zero-shot Composed Image Retrieval
2026 Β· Tianyue Wang, Leigang Qu, Tianyu Yang, et al.
Abstract
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier
Authors
(none)
Tags
Stats
Related papers
- Fine-grained Zero-shot Composed Image Retrieval With Complementary Visual-semantic Integration (2026)1.24
- Training-free Zero-shot Composed Image Retrieval Via Weighted Modality Fusion And Similarity (2024)5.84
- SETR: A Two-stage Semantic-enhanced Framework For Zero-shot Composed Image Retrieval (2025)0.00
- From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval (2025)0.00
- Isearle: Improving Textual Inversion For Zero-shot Composed Image Retrieval (2024)12.09
- Multimodal Reasoning Agent For Zero-shot Composed Image Retrieval (2025)0.00
- Generative Editing In The Joint Vision-language Space For Zero-shot Composed Image Retrieval (2025)0.00
- Reason-before-retrieve: One-stage Reflective Chain-of-thoughts For Training-free Zero-shot Composed Image Retrieval (2024)10.03