Large Language Models For Captioning And Retrieving Remote Sensing Images
2024 · João Daniel Silva, João Magalhães, Devis Tuia, et al.
Abstract
Image captioning and cross-modal retrieval are examples of tasks that involve the joint analysis of visual and linguistic information. In connection to remote sensing imagery, these tasks can help non-expert users in extracting relevant Earth observation information for a variety of applications. Still, despite some previous efforts, the development and application of vision and language models to the remote sensing domain have been hindered by the relatively small size of the available datasets and models used in previous studies. In this work, we propose RS-CapRet, a Vision and Language method for remote sensing tasks, in particular image captioning and text-image retrieval. We specifically propose to use a highly capable large decoder language model together with image encoders adapted to remote sensing imagery through contrastive language-image pre-training. To bridge together the image encoder and language decoder, we propose training simple linear layers with examples from combin
Authors
(none)
Tags
Stats
Related papers
- DGTRSD & DGTRS-CLIP: A Dual-granularity Remote Sensing Image-text Dataset And Vision Language Foundation Model For Alignment (2025)2.98
- Towards A Multimodal Framework For Remote Sensing Image Change Retrieval And Captioning (2024)8.85
- Remote Sensing Retrieval-augmented Generation: Bridging Remote Sensing Imagery And Comprehensive Knowledge With A Multi-modal Dataset And Retrieval-augmented Generation Model (2025)2.26
- Vlm2geovec: Toward Universal Multimodal Embeddings For Remote Sensing (2025)0.00
- Composed Image Retrieval For Remote Sensing (2024)11.03
- Priorclip: Visual Prior Guided Vision-language Model For Remote Sensing Image-text Retrieval (2024)0.00
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00
- Towards Retrieval-augmented Architectures For Image Captioning (2024)9.41