Probvlm: Probabilistic Adapter For Frozen Vision-language Models
2023 Β· Uddeshya Upadhyay, Shyamgopal Karthik, Massimiliano Mancini, et al.
Abstract
Large-scale vision-language models (VLMs) like CLIP successfully find correspondences between images and text. Through the standard deterministic mapping process, an image or a text sample is mapped to a single vector in the embedding space. This is problematic: as multiple samples (images or text) can abstract the same concept in the physical world, deterministic embeddings do not reflect the inherent ambiguity in the embedding space. We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained VLMs via inter/intra-modal alignment in a post-hoc manner without needing large-scale datasets or computing. On four challenging datasets, i.e., COCO, Flickr, CUB, and Oxford-flowers, we estimate the multi-modal embedding uncertainties for two VLMs, i.e., CLIP and BLIP, quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods. Furthermore, we propose active learning and model
Authors
(none)
Tags
Stats
Related papers
- Medprobclip: Probabilistic Adaptation Of Vision-language Foundation Model For Reliable Radiograph-report Retrieval (2026)0.00
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Queryadapter: Rapid Adaptation Of Vision-language Models In Response To Natural Language Queries (2025)0.00
- Koo-fu CLIP: Closed-form Adaptation Of Vision-language Models Via Fukunaga-koontz Linear Discriminant Analysis (2026)0.00
- Hyperdimensional Cross-modal Alignment Of Frozen Language And Image Models For Efficient Image Captioning (2026)0.00
- Uclip: Parameter-efficient Multilingual Extension Of Vision-language Models With Unpaired Data (2025)0.00
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00