Caption-matching: A Multimodal Approach For Cross-domain Image Retrieval | Awesome Similarity Search Papers

Caption-matching: A Multimodal Approach For Cross-domain Image Retrieval

Lucas Iijima, Nikolaos Giakoumoglou, Tania Stathaki · 21st International Conference on Computer Vision Theory and Applications · 2024

Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision, aiming to match images across different visual domains such as sketches, paintings, and photographs. Existing CDIR methods rely either on supervised learning with labeled cross-domain correspondences or on methods that require training or fine-tuning on target datasets, often struggling with substantial domain gaps and limited generalization to unseen domains. This paper introduces a novel CDIR approach that incorporates textual context by leveraging publicly available pre-trained vision-language models. Our method, Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation, enabling effective cross-domain similarity computation without the need for labeled data or further training. We evaluate our method on standard CDIR benchmark datasets, demonstrating state-of-the-art performance in plug-and-play settings with consistent improvements on Office-Home and DomainNet over previous methods. We also demonstrate our method’s effectiveness on a dataset of AI-generated images from Midjourney, showcasing its ability to handle complex, multi-domain queries.

Explore more on:
Survey Paper Image Retrieval
Similar Work
Loading…