Prompthub: Enhancing Multi-prompt Visual In-context Learning With Locality-aware Fusion, Concentration And Alignment
2026 Β· Tianci Luo, Jinpeng Wang, Shiyu Qin, et al.
Abstract
Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-dist
Authors
(none)
Tags
Stats
Related papers
- Love Me, Love My Label: Rethinking The Role Of Labels In Prompt Retrieval For Visual In-context Learning (2026)1.57
- Context-adaptive Multi-prompt Embedding With Large Language Models For Vision-language Alignment (2025)0.00
- Visual Adaptive Prompting For Compositional Zero-shot Learning (2025)2.26
- What Makes Good Examples For Visual In-context Learning? (2023)3.58
- Vop: Text-video Co-operative Prompt Tuning For Cross-modal Retrieval (2022)16.41
- Elevating All Zero-shot Sketch-based Image Retrieval Through Multimodal Prompt Learning (2024)6.34
- Pitl: Cross-modal Retrieval With Weakly-supervised Vision-language Pre-training Via Prompting (2023)7.16
- Towards In-context Scene Understanding (2023)0.00