Pretrain Like Your Inference: Masked Tuning Improves Zero-shot Composed Image Retrieval
2023 Β· Junyang Chen, Hanjiang Lai
Abstract
Zero-shot composed image retrieval (ZS-CIR), which takes a textual modification and a reference image as a query to retrieve a target image without triplet labeling, has gained more and more attention in data mining. Current ZS-CIR research mainly relies on the generalization ability of pre-trained vision-language models, e.g., CLIP. However, the pre-trained vision-language models and CIR tasks have substantial discrepancies, where the vision-language models focus on learning the similarities but CIR aims to learn the modifications of the image guided by text. In this paper, we introduce a novel unlabeled and pre-trained masked tuning approach, which reduces the gap between the pre-trained vision-language model and the downstream CIR task. First, to reduce the gap, we reformulate the contrastive learning of the vision-language model as the CIR task, where we randomly mask input image patches to generate \(\langle\)masked image, text, image\(\rangle\) triplet from an image-text pair. Th
Authors
(none)
Tags
Stats
Related papers
- Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs (2024)5.24
- From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval (2025)0.00
- Modality And Task Adaptation For Enhanced Zero-shot Composed Image Retrieval (2024)0.00
- Hycir: Boosting Zero-shot Composed Image Retrieval With Synthetic Labels (2024)0.00
- Scaling Prompt Instructed Zero Shot Composed Image Retrieval With Image-only Data (2025)0.00
- Data-efficient Generalization For Zero-shot Composed Image Retrieval (2025)2.26
- SCOT: Self-supervised Contrastive Pretraining For Zero-shot Compositional Retrieval (2025)0.00
- An Efficient Post-hoc Framework For Reducing Task Discrepancy Of Text Encoders For Composed Image Retrieval (2024)0.00