IRPAPERS: A Visual Document Benchmark For Scientific Retrieval And Question Answering
2026 Β· Connor Shorten, Augustas Skaburskas, Daniel M. Jones, et al.
Abstract
AI systems have achieved remarkable success in processing text and relational data, yet visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer retrieval and generation directly from document images. This raises a key question: How do image-based systems compare to established text-based methods? We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page. Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems. Text retrieval using Arctic 2.0 embeddings, BM25, and hybrid text search achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval reaches 43%, 78%, and 93%, respectively. The two modalities exhibit complementary failures, enabling multimodal hybri
Authors
(none)
Tags
Stats
Related papers
- Visr-bench: An Empirical Study On Visual Retrieval-augmented Generation For Multilingual Long Document Understanding (2025)0.00
- MIRACL-VISION: A Large, Multilingual, Visual Document Retrieval Benchmark (2025)0.00
- Scimmir: Benchmarking Scientific Multi-modal Information Retrieval (2024)8.07
- Visual-rag: Benchmarking Text-to-image Retrieval Augmented Generation For Visual Knowledge Intensive Queries (2025)0.00
- Document Haystacks: Vision-language Reasoning Over Piles Of 1000+ Documents (2024)2.83
- Document-as-image Representations Fall Short For Scientific Retrieval (2026)0.00
- Enhancing Document VQA Models Via Retrieval-augmented Generation (2025)0.00
- Modernvbert: Towards Smaller Visual Document Retrievers (2025)0.00