Remote Sensing Retrieval-augmented Generation: Bridging Remote Sensing Imagery And Comprehensive Knowledge With A Multi-modal Dataset And Retrieval-augmented Generation Model
2025 Β· Congcong Wen, Yiting Lin, Xiaokang Qu, et al.
Abstract
Recent progress in VLMs has demonstrated impressive capabilities across a variety of tasks in the natural image domain. Motivated by these advancements, the remote sensing community has begun to adopt VLMs for remote sensing vision-language tasks, including scene understanding, image captioning, and visual question answering. However, existing remote sensing VLMs typically rely on closed-set scene understanding and focus on generic scene descriptions, yet lack the ability to incorporate external knowledge. This limitation hinders their capacity for semantic reasoning over complex or context-dependent queries that involve domain-specific or world knowledge. To address these challenges, we first introduced a multimodal Remote Sensing World Knowledge (RSWK) dataset, which comprises high-resolution satellite imagery and detailed textual descriptions for 14,141 well-known landmarks from 175 countries, integrating both remote sensing domain knowledge and broader world knowledge. Building upo
Authors
(none)
Tags
Stats
Related papers
- Vlm2geovec: Toward Universal Multimodal Embeddings For Remote Sensing (2025)0.00
- A Recipe For Improving Remote Sensing VLM Zero Shot Generalization (2025)0.00
- Large Language Models For Captioning And Retrieving Remote Sensing Images (2024)0.00
- RAVEN: Multitask Retrieval Augmented Vision-language Learning (2024)0.00
- Object Retrieval For Visual Question Answering With Outside Knowledge (2024)0.00
- Open-world 3D Scene Graph Generation For Retrieval-augmented Reasoning (2025)0.00
- Towards A Multimodal Framework For Remote Sensing Image Change Retrieval And Captioning (2024)8.85
- OMGM: Orchestrate Multiple Granularities And Modalities For Efficient Multimodal Retrieval (2025)0.00