AVION: Aerial Vision-language Instruction From Offline Teacher To Prompt-tuned Network
2026 Β· Yu Hu, Jianyang Gu, Hao Liu, et al.
Abstract
Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference. Experiments on six optical remote sensing benchmarks show that AVION improves few-shot classification and
Authors
(none)
Tags
Stats
Related papers
- Queryadapter: Rapid Adaptation Of Vision-language Models In Response To Natural Language Queries (2025)0.00
- Large Language Models For Captioning And Retrieving Remote Sensing Images (2024)0.00
- Vlm2geovec: Toward Universal Multimodal Embeddings For Remote Sensing (2025)0.00
- CAVL: Learning Contrastive And Adaptive Representations Of Vision And Language (2023)0.00
- AMMKD: Adaptive Multimodal Multi-teacher Distillation For Lightweight Vision-language Models (2025)0.00
- Avlnet: Learning Audio-visual Language Representations From Instructional Videos (2020)12.87
- RAVEN: Multitask Retrieval Augmented Vision-language Learning (2024)0.00
- Priorclip: Visual Prior Guided Vision-language Model For Remote Sensing Image-text Retrieval (2024)0.00