Agenticocr: Parsing Only What You Need For Efficient Retrieval-augmented Generation
2026 Β· Zhengren Wang, Dongsheng Ma, Huaping Zhong, et al.
Abstract
The expansion of retrieval-augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, such as financial reports. While page-level chunking and retrieval is a natural starting point, it creates a critical bottleneck: delivering entire pages to the generator introduces excessive extraneous context. This not only overloads the generator's attention mechanism but also dilutes the most salient evidence. Moreover, compressing these information-rich pages into a limited visual token budget further increases the risk of hallucinations. To address this, we introduce AgenticOCR, a dynamic parsing paradigm that transforms optical character recognition (OCR) from a static, full-text process into a query-driven, on-demand extraction system. By autonomously analyzing document layout in a "thinking with images" manner, AgenticOCR identifies and selectively recognizes regions of interest. This approach performs on-demand decompression of v
Authors
(none)
Tags
Stats
Related papers
- Beyond Patch Aggregation: 3-pass Pyramid Indexing For Vision-enhanced Document Retrieval (2025)0.00
- Enhancing Document VQA Models Via Retrieval-augmented Generation (2025)0.00
- Regionrag: Region-level Retrieval-augmented Generation For Visual Document Understanding (2025)0.00
- Cross-modal RAG: Sub-dimensional Text-to-image Retrieval-augmented Generation (2025)0.00
- VISOR: Agentic Visual Retrieval-augmented Generation Via Iterative Search And Over-horizon Reasoning (2026)0.00
- Xrag: Extreme Context Compression For Retrieval-augmented Generation With One Token (2024)7.81
- Vdocrag: Retrieval-augmented Generation Over Visually-rich Documents (2025)6.34
- Universalrag: Retrieval-augmented Generation Over Corpora Of Diverse Modalities And Granularities (2025)0.00