ELIP: Enhanced Visual-language Foundation Models For Image Retrieval
2025 Β· Guanqi Zhan, Yuanpei Liu, Kai Han, et al.
Abstract
The objective in this paper is to improve the performance of text-to-image retrieval. To this end, we introduce a new framework that can boost the performance of large-scale pre-trained vision-language models, so that they can be used for text-to-image re-ranking. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple MLP mapping network, to predict a set of visual prompts to condition the ViT image encoding. ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks. To train the architecture with limited computing resources, we develop a 'student friendly' best practice, involving global hard sample mining, and curation of a large-scale dataset. On the evaluation side, we set up two new out-of-distribution (OOD) benchmarks, Occluded COCO and ImageNet-R, to assess the zero-shot generalisation of the models to different domains. The results demonstrate that ELIP significantly boosts CLIP/SigLIP/SigLIP-2 text-to-image retrieval
Authors
(none)
Tags
Stats
Related papers
- Lexlip: Lexicon-bottlenecked Language-image Pre-training For Large-scale Image-text Retrieval (2023)10.85
- Hivlp: Hierarchical Vision-language Pre-training For Fast Image-text Retrieval (2022)0.00
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Priorclip: Visual Prior Guided Vision-language Model For Remote Sensing Image-text Retrieval (2024)0.00
- Enhancing Image Retrieval : A Comprehensive Study On Photo Search Using The CLIP Mode (2024)0.00
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35
- Evdclip: Improving Vision-language Retrieval With Entity Visual Descriptions From Large Language Models (2025)0.00
- SILC: Improving Vision Language Pretraining With Self-distillation (2023)10.21