Prompt-guided Attention Head Selection For Focus-oriented Image Retrieval
2025 Β· Yuji Nozawa, Yu-Chieh Lin, Kazumoto Nakamura, et al.
Abstract
The goal of this paper is to enhance pretrained Vision Transformer (ViT) models for focus-oriented image retrieval with visual prompting. In real-world image retrieval scenarios, both query and database images often exhibit complexity, with multiple objects and intricate backgrounds. Users often want to retrieve images with specific object, which we define as the Focus-Oriented Image Retrieval (FOIR) task. While a standard image encoder can be employed to extract image features for similarity matching, it may not perform optimally in the multi-object-based FOIR task. This is because each image is represented by a single global feature vector. To overcome this, a prompt-based image retrieval solution is required. We propose an approach called Prompt-guided attention Head Selection (PHS) to leverage the head-wise potential of the multi-head attention mechanism in ViT in a promptable manner. PHS selects specific attention heads by matching their attention maps with user's visual prompts,
Authors
(none)
Tags
Stats
Related papers
- Fine-grained Retrieval Prompt Tuning (2022)10.07
- Focus, Distinguish, And Prompt: Unleashing CLIP For Efficient And Flexible Scene Text Retrieval (2024)8.80
- Find Your Needle: Small Object Image Retrieval Via Multi-object Attention Optimization (2025)0.00
- DVF: Advancing Robust And Accurate Fine-grained Image Retrieval With Retrieval Guidelines (2024)9.03
- Revisiting Human-in-the-loop Object Retrieval With Pre-trained Vision Transformers (2026)0.00
- Vop: Text-video Co-operative Prompt Tuning For Cross-modal Retrieval (2022)16.41
- Highlighting What Matters: Promptable Embeddings For Attribute-focused Image Retrieval (2025)0.00
- Priorclip: Visual Prior Guided Vision-language Model For Remote Sensing Image-text Retrieval (2024)0.00