Cross-modal Retrieval Meets Inference:improving Zero-shot Classification With Cross-modal Retrieval
2023 Β· Seongha Eom, Namgyu Ho, Jaehoon Oh, et al.
Abstract
Contrastive language-image pre-training (CLIP) has demonstrated remarkable zero-shot classification ability, namely image classification using novel text labels. Existing works have attempted to enhance CLIP by fine-tuning on downstream tasks, but these have inadvertently led to performance degradation on unseen classes, thus harming zero-shot generalization. This paper aims to address this challenge by leveraging readily available image-text pairs from an external dataset for cross-modal guidance during inference. To this end, we propose X-MoRe, a novel inference method comprising two key steps: (1) cross-modal retrieval and (2) modal-confidence-based ensemble. Given a query image, we harness the power of CLIP's cross-modal representations to retrieve relevant textual information from an external image-text pair dataset. Then, we assign higher weights to the more reliable modality between the original query image and retrieved text, contributing to the final prediction. X-MoRe demonst
Authors
(none)
Tags
Stats
Related papers
- CIBR: Cross-modal Information Bottleneck Regularization For Robust CLIP Generalization (2025)4.52
- Distill CLIP (DCLIP): Enhancing Image-text Retrieval Via Cross-modal Transformer Distillation (2025)0.00
- Enhancing Image Retrieval : A Comprehensive Study On Photo Search Using The CLIP Mode (2024)0.00
- Cross The Gap: Exposing The Intra-modal Misalignment In CLIP Via Modality Inversion (2025)3.64
- Towards Zero-shot Cross-lingual Image Retrieval (2020)2.46
- Towards Zero-shot Cross-lingual Image Retrieval And Tagging (2021)2.46
- Mobileclip: Fast Image-text Models Through Multi-modal Reinforced Training (2023)18.12
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34