Understanding Retrieval-augmented Task Adaptation For Vision-language Models
2024 Β· Yifei Ming, Yixuan Li
Abstract
Pre-trained contrastive vision-language models have demonstrated remarkable performance across a wide range of tasks. However, they often struggle on fine-trained datasets with categories not adequately represented during pre-training, which makes adaptation necessary. Recent works have shown promising results by utilizing samples from web-scale databases for retrieval-augmented adaptation, especially in low-data regimes. Despite the empirical success, understanding how retrieval impacts the adaptation of vision-language models remains an open research question. In this work, we adopt a reflective perspective by presenting a systematic study to understand the roles of key components in retrieval-augmented adaptation. We unveil new insights on uni-modal and cross-modal retrieval and highlight the critical role of logit ensemble for effective adaptation. We further present theoretical underpinnings that directly support our empirical observations.
Authors
(none)
Tags
Stats
Related papers
- Ucdr-adapter: Exploring Adaptation Of Pre-trained Vision-language Models For Universal Cross-domain Retrieval (2024)4.52
- Queryadapter: Rapid Adaptation Of Vision-language Models In Response To Natural Language Queries (2025)0.00
- RAVEN: Multitask Retrieval Augmented Vision-language Learning (2024)0.00
- Infusing Fine-grained Visual Knowledge To Vision-language Models (2025)0.00
- Cross-modal Adapter: Parameter-efficient Transfer Learning Approach For Vision-language Models (2024)6.77
- Adapting Dual-encoder Vision-language Models For Paraphrased Retrieval (2024)0.00
- A Comprehensive Empirical Study Of Vision-language Pre-trained Model For Supervised Cross-modal Retrieval (2022)0.00
- C3: Continued Pretraining With Contrastive Weak Supervision For Cross Language Ad-hoc Retrieval (2022)8.35