Modernvbert: Towards Smaller Visual Document Retrievers
2025 · Paul Teiletche, Quentin MacÉ, Max Conti, et al.
Abstract
Retrieving specific information from a large corpus of documents is a prevalent industrial use case of modern AI, notably due to the popularity of Retrieval-Augmented Generation (RAG) systems. Although neural document retrieval models have historically operated exclusively in the text space, Visual Document Retrieval (VDR) models - large vision-language decoders repurposed as embedding models which directly work with page screenshots as inputs - are increasingly popular due to the performance and indexing latency gains they offer. In this work, we show that, while cost-efficient, this approach of repurposing generative models bottlenecks retrieval performance. Through controlled experiments, we revisit the entire training pipeline, and establish a principled recipe for improving visual document retrieval models. We notably measure the impact of attention masking, image resolution, modality alignment data regimes, and late interaction centered contrastive objectives which emerge as cent
Authors
(none)
Tags
Stats
Related papers
- SERVAL: Surprisingly Effective Zero-shot Visual Document Retrieval Powered By Large Vision And Language Models (2025)0.00
- Enhancing Document VQA Models Via Retrieval-augmented Generation (2025)0.00
- Colpali: Efficient Document Retrieval With Vision Language Models (2024)0.00
- Unlocking Multimodal Document Intelligence: From Current Triumphs To Future Frontiers Of Visual Document Retrieval (2026)0.00
- Globaldoc: A Cross-modal Vision-language Framework For Real-world Document Image Retrieval And Classification (2023)3.58
- Reproducibility, Replicability, And Insights Into Visual Document Retrieval With Late Interaction (2025)2.26
- Vdocrag: Retrieval-augmented Generation Over Visually-rich Documents (2025)6.34
- Nemotron Colembed V2: Top-performing Late Interaction Embedding Models For Visual Document Retrieval (2026)0.00