Nanovdr: Distilling A 2B Vision-language Retriever Into A 70M Text-only Encoder For Visual Document Retrieval
2026 Β· Zhuchenyang Liu, Yao Zhang, Yu Xiao
Abstract
Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached
Authors
(none)
Tags
Stats
Related papers
- SERVAL: Surprisingly Effective Zero-shot Visual Document Retrieval Powered By Large Vision And Language Models (2025)0.00
- MURE: Hierarchical Multi-resolution Encoding Via Vision-language Models For Visual Document Retrieval (2026)0.00
- Modernvbert: Towards Smaller Visual Document Retrievers (2025)0.00
- Docpruner: A Storage-efficient Framework For Multi-vector Visual Document Retrieval Via Adaptive Patch-level Embedding Pruning (2025)0.00
- Vldeformer: Vision-language Decomposed Transformer For Fast Cross-modal Retrieval (2021)10.21
- Colpali: Efficient Document Retrieval With Vision Language Models (2024)0.00
- Evo-retriever: Llm-guided Curriculum Evolution With Viewpoint-pathway Collaboration For Multimodal Document Retrieval (2026)0.00
- Sculpting The Vector Space: Towards Efficient Multi-vector Visual Document Retrieval Via Prune-then-merge Framework (2026)0.00