Good4cir: Generating Detailed Synthetic Captions For Composed Image Retrieval
2025 Β· Pranavi Kolouju, Eric Xing, Robert Pless, et al.
Abstract
Composed image retrieval (CIR) enables users to search images using a reference image combined with textual modifications. Recent advances in vision-language models have improved CIR, but dataset limitations remain a barrier. Existing datasets often rely on simplistic, ambiguous, or insufficient manual annotations, hindering fine-grained retrieval. We introduce good4cir, a structured pipeline leveraging vision-language models to generate high-quality synthetic annotations. Our method involves: (1) extracting fine-grained object descriptions from query images, (2) generating comparable descriptions for target images, and (3) synthesizing textual instructions capturing meaningful transformations between images. This reduces hallucination, enhances modification diversity, and ensures object-level consistency. Applying our method improves existing datasets and enables creating new datasets across diverse domains. Results demonstrate improved retrieval accuracy for CIR models trained on our
Authors
(none)
Tags
Stats
Related papers
- Scaling Prompt Instructed Zero Shot Composed Image Retrieval With Image-only Data (2025)0.00
- Automatic Synthesis Of High-quality Triplet Data For Composed Image Retrieval (2025)0.00
- Triplet Synthesis For Enhancing Composed Image Retrieval Via Counterfactual Image Generation (2025)3.58
- Context-cir: Learning From Concepts In Text For Composed Image Retrieval (2025)4.67
- Scale Up Composed Image Retrieval Learning Via Modification Text Generation (2025)3.58
- A Sanity Check On Composed Image Retrieval (2026)0.00
- Improving Composed Image Retrieval Via Contrastive Learning With Scaling Positives And Negatives (2024)11.30
- Hycir: Boosting Zero-shot Composed Image Retrieval With Synthetic Labels (2024)0.00