Zero-shot Composed Image Retrieval With Textual Inversion
2023 Β· Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, et al.
Abstract
Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for labeling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset. Our approach, named zero-Shot composEd imAge Retrieval with textuaL invErsion (SEARLE), maps the visual features of the reference image into a pseudo-word token in CLIP token embedding space and integrates it with the relative caption. To support research on ZS-CIR, we introduce an open-domain benchmarking dataset named Composed Image Retrieval on Common Objects in context (CIRCO), which is the first dataset for CIR containing multiple ground truths for each query. The experiments show that SEARLE exhibits better performance
Authors
(none)
Tags
Stats
Related papers
- Isearle: Improving Textual Inversion For Zero-shot Composed Image Retrieval (2024)12.09
- From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval (2025)0.00
- SETR: A Two-stage Semantic-enhanced Framework For Zero-shot Composed Image Retrieval (2025)0.00
- Fine-grained Zero-shot Composed Image Retrieval With Complementary Visual-semantic Integration (2026)1.24
- Hycir: Boosting Zero-shot Composed Image Retrieval With Synthetic Labels (2024)0.00
- Image2sentence Based Asymmetrical Zero-shot Composed Image Retrieval (2024)0.00
- Multimodal Reasoning Agent For Zero-shot Composed Image Retrieval (2025)0.00
- Reason-before-retrieve: One-stage Reflective Chain-of-thoughts For Training-free Zero-shot Composed Image Retrieval (2024)10.03