Vlm2geovec: Toward Universal Multimodal Embeddings For Remote Sensing
2025 · Emanuel Sánchez Aimar, Gulnaz Zhambulova, Fahad Shahbaz Khan, et al.
Abstract
Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose \(\textbf\{VLM2GeoVec\}\), an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce \(\textbf\{RSMEB\}\), a novel benchmark
Authors
(none)
Tags
Stats
Related papers
- Vlm2vec-v2: Advancing Multimodal Embedding For Videos, Images, And Visual Documents (2025)0.00
- Remote Sensing Retrieval-augmented Generation: Bridging Remote Sensing Imagery And Comprehensive Knowledge With A Multi-modal Dataset And Retrieval-augmented Generation Model (2025)2.26
- Vlm2vec: Training Vision-language Models For Massive Multimodal Embedding Tasks (2024)0.00
- Large Language Models For Captioning And Retrieving Remote Sensing Images (2024)0.00
- A Recipe For Improving Remote Sensing VLM Zero Shot Generalization (2025)0.00
- Direction-oriented Visual-semantic Embedding Model For Remote Sensing Image-text Retrieval (2023)11.29
- Come-vl: Scaling Complementary Multi-encoder Vision-language Learning (2026)0.00
- Meol: Training-free Instruction-guided Multimodal Embedder For Vector Graphics And Image Retrieval (2026)0.00