Transcending Fusion: A Multi-scale Alignment Method For Remote Sensing Image-text Retrieval
2024 Β· Rui Yang, Shuang Wang, Yingping Han, et al.
Abstract
Remote Sensing Image-Text Retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain. Considering the multi-scale representations in image content and text vocabulary can enable the models to learn richer representations and enhance retrieval. Current multi-scale RSITR approaches typically align multi-scale fused image features with text features, but overlook aligning image-text pairs at distinct scales separately. This oversight restricts their ability to learn joint representations suitable for effective retrieval. We introduce a novel Multi-Scale Alignment (MSA) method to overcome this limitation. Our method comprises three key innovations: (1) Multi-scale Cross-Modal Alignment Transformer (MSCMAT), which computes cross-attention between single-scale image features and localized text features, integrating global textual context to derive a matching score matrix within a mini-batch, (2) a multi-scale cross-modal semantic alignment loss that
Authors
(none)
Tags
Stats
Related papers
- Exploring A Fine-grained Multiscale Method For Cross-modal Remote Sensing Image Retrieval (2022)16.73
- Remote Sensing Cross-modal Text-image Retrieval Based On Global And Local Information (2022)19.48
- Robust Remote Sensing Image-text Retrieval With Noisy Correspondence (2026)1.24
- Fast-then-fine: A Two-stage Framework With Multi-granular Representation For Cross-modal Retrieval In Remote Sensing (2026)0.00
- Scale-semantic Joint Decoupling Network For Image-text Retrieval In Remote Sensing (2022)8.82
- Iebaker: Improved Remote Sensing Image-text Retrieval Framework Via Eliminate Before Align And Keyword Explicit Reasoning (2025)2.86
- Towards A Multimodal Framework For Remote Sensing Image Change Retrieval And Captioning (2024)8.85
- Self-supervised Cross-modal Text-image Time Series Retrieval In Remote Sensing (2025)3.58