Fast-then-fine: A Two-stage Framework With Multi-granular Representation For Cross-modal Retrieval In Remote Sensing
2026 Β· Xi Chen, Xu Chen, Xiangyang Jia, et al.
Abstract
Remote sensing (RS) image-text retrieval plays a critical role in understanding massive RS imagery. However, the dense multi-object distribution and complex backgrounds in RS imagery make it difficult to simultaneously achieve fine-grained cross-modal alignment and efficient retrieval. Existing methods either rely on complex cross-modal interactions that lead to low retrieval efficiency, or depend on large-scale vision-language model pre-training, which requires massive data and computational resources. To address these issues, we propose a fast-then-fine (FTF) two-stage retrieval framework that decomposes retrieval into a text-agnostic recall stage for efficient candidate selection and a text-guided rerank stage for fine-grained alignment. Specifically, in the recall stage, text-agnostic coarse-grained representations are employed for efficient candidate selection; in the rerank stage, a parameter-free balanced text-guided interaction block enhances fine-grained alignment without intr
Authors
(none)
Tags
Stats
Related papers
- Exploring A Fine-grained Multiscale Method For Cross-modal Remote Sensing Image Retrieval (2022)16.73
- Remote Sensing Cross-modal Text-image Retrieval Based On Global And Local Information (2022)19.48
- Transcending Fusion: A Multi-scale Alignment Method For Remote Sensing Image-text Retrieval (2024)11.92
- Retrieve Fast, Rerank Smart: Cooperative And Joint Approaches For Improved Cross-modal Retrieval (2021)10.97
- Multi-modal Reference Learning For Fine-grained Text-to-image Retrieval (2025)6.77
- Coarse-to-fine: Learning Compact Discriminative Representation For Single-stage Image Retrieval (2023)9.35
- Towards A Multimodal Framework For Remote Sensing Image Change Retrieval And Captioning (2024)8.85
- Robust Remote Sensing Image-text Retrieval With Noisy Correspondence (2026)1.24