An Efficient Post-hoc Framework For Reducing Task Discrepancy Of Text Encoders For Composed Image Retrieval

Abstract

Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable image searches. The mainstream Zero-Shot (ZS) CIR methods bypass the need for expensive training CIR triplets by projecting image embeddings into the text token embedding space, forming a composed query for retrieval. However, we highlight an inherent limitation in these projection-based CIR: a task discrepancy of text encoders between the original pre-training task of the encoders (text \(\leftrightarrow\) image) and the target CIR task (image + text \(\leftrightarrow\) image), which potentially negatively impacts CIR performance. To reduce such a discrepancy, a naive solution would be to train both image and text encoders with CIR triplets in a supervised manner. Instead, we introduce Reducing Task Discrepancy of Text Encoders (RTD), an efficient text-only post-hoc framework that complements projection-based CIR methods. We devise a novel target-ancho

An Efficient Post-hoc Framework For Reducing Task Discrepancy Of Text Encoders For Composed Image Retrieval

Abstract

Authors

Tags

Stats

Related papers