Visual Adaptive Prompting For Compositional Zero-shot Learning
2025 Β· Kyle Stein, Arash Mahyari, Guillermo Francia, et al.
Abstract
Vision-Language Models (VLMs) have demonstrated impressive multimodal capabilities in learning joint representations of visual and textual data, making them powerful tools for tasks such as Compositional Zero-Shot Learning (CZSL). CZSL requires models to generalize to novel combinations of visual primitives--such as attributes and objects--that were not explicitly encountered during training. Recent works in prompting for CZSL have focused on modifying inputs for the text encoder, often using static prompts that do not change across varying visual contexts. However, these approaches struggle to fully capture varying visual contexts, as they focus on text adaptation rather than leveraging visual features for compositional reasoning. To address this, we propose a Visual Adaptive Prompting System (VAPS) that leverages a learnable visual prompt repository and similarity-based retrieval mechanism within the framework of VLMs to bridge the gap between semantic and visual features. Our method
Authors
(none)
Tags
Stats
Related papers
- Pitl: Cross-modal Retrieval With Weakly-supervised Vision-language Pre-training Via Prompting (2023)7.16
- Elevating All Zero-shot Sketch-based Image Retrieval Through Multimodal Prompt Learning (2024)6.34
- PDV: Prompt Directional Vectors For Zero-shot Composed Image Retrieval (2025)0.00
- Context-adaptive Multi-prompt Embedding With Large Language Models For Vision-language Alignment (2025)0.00
- Prompthub: Enhancing Multi-prompt Visual In-context Learning With Locality-aware Fusion, Concentration And Alignment (2026)2.20
- Leveraging Large Vision-language Model As User Intent-aware Encoder For Composed Image Retrieval (2024)3.58
- Leveraging Retrieval-augmented Tags For Large Vision-language Understanding In Complex Scenes (2024)0.00
- Learning Visual Composition Through Improved Semantic Guidance (2024)0.00