Scaling Prompt Instructed Zero Shot Composed Image Retrieval With Image-only Data
2025 Β· Yiqun Duan, Sameera Ramasinghe, Stephen Gould, et al.
Abstract
Composed Image Retrieval (CIR) is the task of retrieving images matching a reference image augmented with a text, where the text describes changes to the reference image in natural language. Traditionally, models designed for CIR have relied on triplet data containing a reference image, reformulation text, and a target image. However, curating such triplet data often necessitates human intervention, leading to prohibitive costs. This challenge has hindered the scalability of CIR model training even with the availability of abundant unlabeled data. With the recent advances in foundational models, we advocate a shift in the CIR training paradigm where human annotations can be efficiently replaced by large language models (LLMs). Specifically, we demonstrate the capability of large captioning and language models in efficiently generating data for CIR only relying on unannotated image collections. Additionally, we introduce an embedding reformulation architecture that effectively combines
Authors
(none)
Tags
Stats
Related papers
- Context-cir: Learning From Concepts In Text For Composed Image Retrieval (2025)4.67
- From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval (2025)0.00
- Compositional Image Retrieval Via Instruction-aware Contrastive Learning (2024)0.00
- Scale Up Composed Image Retrieval Learning Via Modification Text Generation (2025)3.58
- Zero-shot Composed Text-image Retrieval (2023)0.00
- Improving Composed Image Retrieval Via Contrastive Learning With Scaling Positives And Negatives (2024)11.30
- Sentence-level Prompts Benefit Composed Image Retrieval (2023)3.95
- Good4cir: Generating Detailed Synthetic Captions For Composed Image Retrieval (2025)0.00