Leveraging Data To Say No: Memory Augmented Plug-and-play Selective Prediction
2026 Β· Aditya Sarkar, Yi Li, Jiacheng Cheng, et al.
Abstract
Selective prediction aims to endow predictors with a reject option, to avoid low confidence predictions. However, existing literature has primarily focused on closed-set tasks, such as visual question answering with predefined options or fixed-category classification. This paper considers selective prediction for visual language foundation models, addressing a taxonomy of tasks ranging from closed to open set and from finite to unbounded vocabularies, as in image captioning. We seek training-free approaches of low-complexity, applicable to any foundation model and consider methods based on external vision-language model embeddings, like CLIP. This is denoted as Plug-and-Play Selective Prediction (PaPSP). We identify two key challenges: (1) instability of the visual-language representations, leading to high variance in image-text embeddings, and (2) poor calibration of similarity scores. To address these issues, we propose a memory augmented PaPSP (MA-PaPSP) model, which augments PaPSP
Authors
(none)
Tags
Stats
Related papers
- Mplug: Effective And Efficient Vision-language Learning By Cross-modal Skip-connections (2022)16.14
- Priorclip: Visual Prior Guided Vision-language Model For Remote Sensing Image-text Retrieval (2024)0.00
- VL-JEPA: Joint Embedding Predictive Architecture For Vision-language (2025)1.99
- Context-adaptive Multi-prompt Embedding With Large Language Models For Vision-language Alignment (2025)0.00
- Understanding Retrieval-augmented Task Adaptation For Vision-language Models (2024)0.00
- Probvlm: Probabilistic Adapter For Frozen Vision-language Models (2023)13.41
- Seeing What Matters: Empowering CLIP With Patch Generation-to-selection (2025)5.24
- Learning Customized Visual Models With Retrieval-augmented Knowledge (2023)11.58