Vdocrag: Retrieval-augmented Generation Over Visually-rich Documents
2025 Β· Ryota Tanaka, Taichi Iki, Taku Hasegawa, et al.
Abstract
We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on vis
Authors
(none)
Tags
Stats
Related papers
- Enhancing Document VQA Models Via Retrieval-augmented Generation (2025)0.00
- Visual-rag: Benchmarking Text-to-image Retrieval Augmented Generation For Visual Knowledge Intensive Queries (2025)0.00
- Simpledoc: Multi-modal Document Understanding With Dual-cue Page Retrieval And Iterative Refinement (2025)5.50
- Regionrag: Region-level Retrieval-augmented Generation For Visual Document Understanding (2025)0.00
- Unidoc-rl: Coarse-to-fine Visual RAG With Hierarchical Actions And Dense Rewards (2026)0.00
- Modernvbert: Towards Smaller Visual Document Retrievers (2025)0.00
- Visr-bench: An Empirical Study On Visual Retrieval-augmented Generation For Multilingual Long Document Understanding (2025)0.00
- Document Haystacks: Vision-language Reasoning Over Piles Of 1000+ Documents (2024)2.83