Index Light, Reason Deep: Deferred Visual Ingestion For Visual-dense Document Question Answering
2026 Β· Tao Xu
Abstract
Existing multimodal document question answering methods predominantly adopt a Pre-Ingestion (PI) strategy: during the indexing phase, a Vision Language Model (VLM) is called on every page to generate page descriptions that are then encoded into vectors, and questions are answered via embedding similarity retrieval. However, this approach faces a dual dilemma on visual-dense engineering documents: VLM blind descriptions inevitably lose critical visual details, and embedding retrieval systematically fails on highly similar documents. This paper proposes the Deferred Visual Ingestion (DVI) framework: zero VLM calls during preprocessing, leveraging only document structural information (table of contents, drawing numbers) to automatically build a hierarchical index through the HDNC (Hierarchical Drawing Number Clustering) algorithm; during inference, candidate pages are located via BM25 retrieval, and the original images along with the specific question are sent to a VLM for targeted analys
Authors
(none)
Tags
Stats
Related papers
- Simpledoc: Multi-modal Document Understanding With Dual-cue Page Retrieval And Iterative Refinement (2025)5.50
- Enhancing Document VQA Models Via Retrieval-augmented Generation (2025)0.00
- Document Haystacks: Vision-language Reasoning Over Piles Of 1000+ Documents (2024)2.83
- Developing Visual Augmented Q&A System Using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker (2025)0.00
- Fine-grained Late-interaction Multi-modal Retrieval For Retrieval Augmented Visual Question Answering (2023)5.24
- A Symmetric Dual Encoding Dense Retrieval Framework For Knowledge-intensive Visual Question Answering (2023)9.92
- HIVE: Query, Hypothesize, Verify An LLM Framework For Multimodal Reasoning-intensive Retrieval (2026)0.00
- Pre-training Multi-modal Dense Retrievers For Outside-knowledge Visual Question Answering (2023)7.50