SERVAL: Surprisingly Effective Zero-shot Visual Document Retrieval Powered By Large Vision And Language Models
2025 Β· Thong Nguyen, Yibin Lei, Jia-Huei Ju, et al.
Abstract
Visual Document Retrieval (VDR) typically operates as text-to-image retrieval using specialized bi-encoders trained to directly embed document images. We revisit a zero-shot generate-and-encode pipeline: a vision-language model first produces a detailed textual description of each document image, which is then embedded by a standard text encoder. On the ViDoRe-v2 benchmark, the method reaches 63.4% nDCG@5, surpassing the strongest specialised multi-vector visual document encoder. It also scales better to large collections and offers broader multilingual coverage. Analysis shows that modern vision-language models capture complex textual and visual cues with sufficient granularity to act as a reusable semantic proxy. By offloading modality alignment to pretrained vision-language models, our approach removes the need for computationally intensive text-image contrastive training and establishes a strong zero-shot baseline for future VDR systems.
Authors
(none)
Tags
Stats
Related papers
- Modernvbert: Towards Smaller Visual Document Retrievers (2025)0.00
- Nanovdr: Distilling A 2B Vision-language Retriever Into A 70M Text-only Encoder For Visual Document Retrieval (2026)0.00
- Colpali: Efficient Document Retrieval With Vision Language Models (2024)0.00
- MURE: Hierarchical Multi-resolution Encoding Via Vision-language Models For Visual Document Retrieval (2026)0.00
- Globaldoc: A Cross-modal Vision-language Framework For Real-world Document Image Retrieval And Classification (2023)3.58
- Unlocking Multimodal Document Intelligence: From Current Triumphs To Future Frontiers Of Visual Document Retrieval (2026)0.00
- Survey Of Visual-semantic Embedding Methods For Zero-shot Image Retrieval (2021)4.52
- Vldeformer: Vision-language Decomposed Transformer For Fast Cross-modal Retrieval (2021)10.21