Scale Up Composed Image Retrieval Learning Via Modification Text Generation
2025 Β· Yinan Zhou, Yaxiong Wang, Haokun Lin, et al.
Abstract
Composed Image Retrieval (CIR) aims to search an image of interest using a combination of a reference image and modification text as the query. Despite recent advancements, this task remains challenging due to limited training data and laborious triplet annotation processes. To address this issue, this paper proposes to synthesize the training triplets to augment the training resource for the CIR problem. Specifically, we commence by training a modification text generator exploiting large-scale multimodal models and scale up the CIR learning throughout both the pretraining and fine-tuning stages. During pretraining, we leverage the trained generator to directly create Modification Text-oriented Synthetic Triplets(MTST) conditioned on pairs of images. For fine-tuning, we first synthesize reverse modification text to connect the target image back to the reference image. Subsequently, we devise a two-hop alignment strategy to incrementally close the semantic gap between the multimodal pai
Authors
(none)
Tags
Stats
Related papers
- Triplet Synthesis For Enhancing Composed Image Retrieval Via Counterfactual Image Generation (2025)3.58
- Improving Composed Image Retrieval Via Contrastive Learning With Scaling Positives And Negatives (2024)11.30
- Automatic Synthesis Of High-quality Triplet Data For Composed Image Retrieval (2025)0.00
- Scaling Prompt Instructed Zero Shot Composed Image Retrieval With Image-only Data (2025)0.00
- Pseudo-triplet Guided Few-shot Composed Image Retrieval (2024)0.00
- Context-cir: Learning From Concepts In Text For Composed Image Retrieval (2025)4.67
- From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval (2025)0.00
- Good4cir: Generating Detailed Synthetic Captions For Composed Image Retrieval (2025)0.00