DGTRSD & DGTRS-CLIP: A Dual-granularity Remote Sensing Image-text Dataset And Vision Language Foundation Model For Alignment
2025 Β· Weizhi Chen, Yupeng Deng, Jin Wei, et al.
Abstract
Vision Language Foundation Models based on CLIP architecture for remote sensing primarily rely on short text captions, which often result in incomplete semantic representations. Although longer captions convey richer information, existing models struggle to process them effectively because of limited text-encoding capacity, and there remains a shortage of resources that align remote sensing images with both short text and long text captions. To address this gap, we introduce DGTRSD, a dual-granularity remote sensing image-text dataset, where each image is paired with both a short text caption and a long text description, providing a solid foundation for dual-granularity semantic modeling. Based on this, we further propose DGTRS-CLIP, a dual-granularity curriculum learning framework that combines short text and long text supervision to achieve dual-granularity semantic alignment. Extensive experiments on four typical zero-shot tasks: long text cross-modal retrieval, short text cross-mod
Authors
(none)
Tags
Stats
Related papers
- FG-CLIP: Fine-grained Visual And Textual Alignment (2025)5.75
- Large Language Models For Captioning And Retrieving Remote Sensing Images (2024)0.00
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00
- Priorclip: Visual Prior Guided Vision-language Model For Remote Sensing Image-text Retrieval (2024)0.00
- GOAL: Global-local Object Alignment Learning (2025)2.26
- Scenarioclip: Pretrained Transferable Visual Language Models And Action-genome Dataset For Natural Scene Analysis (2025)0.00
- FLAIR: VLM With Fine-grained Language-informed Image Representations (2024)10.14
- Robust Cross-modal Representation Learning With Progressive Self-distillation (2022)12.33