Caption-matching: A Multimodal Approach For Cross-domain Image Retrieval
2024 Β· Lucas Iijima, Nikolaos Giakoumoglou, Tania Stathaki
Abstract
Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision, aiming to match images across different visual domains such as sketches, paintings, and photographs. Existing CDIR methods rely either on supervised learning with labeled cross-domain correspondences or on methods that require training or fine-tuning on target datasets, often struggling with substantial domain gaps and limited generalization to unseen domains. This paper introduces a novel CDIR approach that incorporates textual context by leveraging publicly available pre-trained vision-language models. Our method, Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation, enabling effective cross-domain similarity computation without the need for labeled data or further training. We evaluate our method on standard CDIR benchmark datasets, demonstrating state-of-the-art performance in plug-and-play settings with consistent improvements on Office-Home and DomainNet
Authors
(none)
Tags
Stats
Related papers
- Scaling Prompt Instructed Zero Shot Composed Image Retrieval With Image-only Data (2025)0.00
- Good4cir: Generating Detailed Synthetic Captions For Composed Image Retrieval (2025)0.00
- HINT: Composed Image Retrieval With Dual-path Compositional Contextualized Network (2026)0.78
- Scene-centric Vs. Object-centric Image-text Cross-modal Retrieval: A Reproducibility Study (2023)5.24
- IDMR: Towards Instance-driven Precise Visual Correspondence In Multimodal Retrieval (2025)2.29
- Visual Delta Generator With Large Multi-modal Models For Semi-supervised Composed Image Retrieval (2024)9.03
- DAFM: Dynamic Adaptive Fusion For Multi-model Collaboration In Composed Image Retrieval (2025)0.00
- Pic2word: Mapping Pictures To Words For Zero-shot Composed Image Retrieval (2023)20.24