An Efficient Post-hoc Framework For Reducing Task Discrepancy Of Text Encoders For Composed Image Retrieval
2024 Β· Jaeseok Byun, Seokhyeon Jeong, Wonjae Kim, et al.
Abstract
Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable image searches. The mainstream Zero-Shot (ZS) CIR methods bypass the need for expensive training CIR triplets by projecting image embeddings into the text token embedding space, forming a composed query for retrieval. However, we highlight an inherent limitation in these projection-based CIR: a task discrepancy of text encoders between the original pre-training task of the encoders (text \(\leftrightarrow\) image) and the target CIR task (image + text \(\leftrightarrow\) image), which potentially negatively impacts CIR performance. To reduce such a discrepancy, a naive solution would be to train both image and text encoders with CIR triplets in a supervised manner. Instead, we introduce Reducing Task Discrepancy of Text Encoders (RTD), an efficient text-only post-hoc framework that complements projection-based CIR methods. We devise a novel target-ancho
Authors
(none)
Tags
Stats
Related papers
- From Mapping To Composing: A Two-stage Framework For Zero-shot Composed Image Retrieval (2025)0.00
- Modality And Task Adaptation For Enhanced Zero-shot Composed Image Retrieval (2024)0.00
- Scaling Prompt Instructed Zero Shot Composed Image Retrieval With Image-only Data (2025)0.00
- Pretrain Like Your Inference: Masked Tuning Improves Zero-shot Composed Image Retrieval (2023)2.86
- Multimodal Reasoning Agent For Zero-shot Composed Image Retrieval (2025)0.00
- Context-cir: Learning From Concepts In Text For Composed Image Retrieval (2025)4.67
- SDR-CIR: Semantic Debias Retrieval Framework For Training-free Zero-shot Composed Image Retrieval (2026)0.00
- Zero-shot Composed Text-image Retrieval (2023)0.00