Modality And Task Adaptation For Enhanced Zero-shot Composed Image Retrieval
2024 Β· Haiwen Li, Fei Su, Zhicheng Zhao
Abstract
As a challenging vision-language task, Zero-Shot Composed Image Retrieval (ZS-CIR) is designed to retrieve target images using bi-modal (image+text) queries. Typical ZS-CIR methods employ an inversion network to generate pseudo-word tokens that effectively represent the input semantics. However, the inversion-based methods suffer from two inherent issues: First, the task discrepancy exists because inversion training and CIR inference involve different objectives. Second, the modality discrepancy arises from the input feature distribution mismatch between training and inference. To this end, we propose a lightweight post-hoc framework, consisting of two components: (1) A new text-anchored triplet construction pipeline leverages a large language model (LLM) to transform a standard image-text dataset into a triplet dataset, where a textual description serves as the target of each triplet. (2) The MoTa-Adapter, a novel parameter-efficient fine-tuning method, adapts the dual encoder to the
Authors
(none)
Tags
Stats
Related papers
- From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval (2025)0.00
- Multimodal Reasoning Agent For Zero-shot Composed Image Retrieval (2025)0.00
- Image2sentence Based Asymmetrical Zero-shot Composed Image Retrieval (2024)0.00
- Fine-grained Zero-shot Composed Image Retrieval With Complementary Visual-semantic Integration (2026)1.24
- An Efficient Post-hoc Framework For Reducing Task Discrepancy Of Text Encoders For Composed Image Retrieval (2024)0.00
- Training-free Zero-shot Composed Image Retrieval Via Weighted Modality Fusion And Similarity (2024)5.84
- Data-efficient Generalization For Zero-shot Composed Image Retrieval (2025)2.26
- Pretrain Like Your Inference: Masked Tuning Improves Zero-shot Composed Image Retrieval (2023)2.86