Redundancy-aware Pretraining Of Vision-language Foundation Models In Remote Sensing
2025 · Mathis Jürgen Adler, Leonard Hackel, Gencer Sumbul, et al.
Abstract
The development of foundation models through pretraining of vision-language models (VLMs) has recently attracted great attention in remote sensing (RS). VLM pretraining aims to learn image and language alignments from a large number of image-text pairs. Each pretraining image is often associated with multiple captions containing redundant information due to repeated or semantically similar phrases, resulting in increased pretraining and inference time. To overcome this, we introduce a weighted feature aggregation (WFA) strategy for VLM pretraining in RS. Our strategy aims to extract and exploit complementary information from multiple captions per image while reducing redundancies through feature aggregation with importance weighting. To calculate adaptive importance weights for different captions of each image, we propose two techniques: (i) non-parametric uniqueness and (ii) learning-based attention. In the first technique, importance weights are calculated based on the bilingual eval
Authors
(none)
Tags
Stats
Related papers
- A Recipe For Improving Remote Sensing VLM Zero Shot Generalization (2025)0.00
- Large Language Models For Captioning And Retrieving Remote Sensing Images (2024)0.00
- Remote Sensing Retrieval-augmented Generation: Bridging Remote Sensing Imagery And Comprehensive Knowledge With A Multi-modal Dataset And Retrieval-augmented Generation Model (2025)2.26
- Learning By Hallucinating: Vision-language Pre-training With Weak Supervision (2022)4.52
- Unsupervised Vision-and-language Pre-training Via Retrieval-based Multi-granular Alignment (2022)10.48
- Vision-language Modelling For Radiological Imaging And Reports In The Low Data Regime (2023)0.00
- Vlm2geovec: Toward Universal Multimodal Embeddings For Remote Sensing (2025)0.00
- Mllms-augmented Visual-language Representation Learning (2023)0.00