G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion And Explicit Semantic Re-ranking For Zero-shot Composed Image Retrieval

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images by integrating a reference image with a corresponding modification text. CIR requires jointly considering the explicit semantics specified in the query and the implicit semantics embedded within its bi-modal composition. Recent training-free Zero-Shot CIR (ZS-CIR) methods leverage Multimodal Large Language Models (MLLMs) to generate detailed target descriptions, converting the implicit information into explicit textual expressions. However, these methods rely heavily on the textual modality and fail to capture the fuzzy retrieval nature that requires considering diverse combinations of candidates. This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER). G-MIXER constructs composed query features that reflect the implicit semantics of reference i

G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion And Explicit Semantic Re-ranking For Zero-shot Composed Image Retrieval

Abstract

Authors

Tags

Stats

Related papers