Priorclip: Visual Prior Guided Vision-language Model For Remote Sensing Image-text Retrieval
2024 Β· Jiancheng Pan, Muyuan Ma, Qing Ma, et al.
Abstract
Remote sensing image-text retrieval plays a crucial role in remote sensing interpretation, yet remains challenging under both closed-domain and open-domain scenarios due to semantic noise and domain shifts. To address these issues, we propose a visual prior-guided vision-language model, PriorCLIP, which leverages visual priors for unbiased representation learning and adaptive vision-language alignment. In the closed-domain setting, PriorCLIP introduces two Progressive Attention Encoder (PAE) structures: Spatial-PAE constructs a belief matrix with instruction embeddings to filter key features and mitigate semantic bias. At the same time, Temporal-PAE exploits cyclic activation across time steps to enhance text representation. For the open-domain setting, we design a two-stage prior representation learning strategy, consisting of large-scale pre-training on coarse-grained image-text pairs, followed by fine-tuning on fine-grained pairs using vision-instruction, which enables robust retrie
Authors
(none)
Tags
Stats
Related papers
- DGTRSD & DGTRS-CLIP: A Dual-granularity Remote Sensing Image-text Dataset And Vision Language Foundation Model For Alignment (2025)2.98
- Enhancing Image Retrieval : A Comprehensive Study On Photo Search Using The CLIP Mode (2024)0.00
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- ELIP: Enhanced Visual-language Foundation Models For Image Retrieval (2025)2.26
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- A Comprehensive Empirical Study Of Vision-language Pre-trained Model For Supervised Cross-modal Retrieval (2022)0.00
- Large Language Models For Captioning And Retrieving Remote Sensing Images (2024)0.00
- Prompt Switch: Efficient CLIP Adaptation For Text-video Retrieval (2023)11.93