Document-as-image Representations Fall Short For Scientific Retrieval
2026 Β· Ghazal Khalighinejad, Raghuveer Thirukovalluru, Alexander H. Oh, et al.
Abstract
Many recent document embedding models are trained on document-as-image representations, embedding rendered pages as images rather than the underlying source. Meanwhile, existing benchmarks for scientific document retrieval, such as ArXivQA and ViDoRe, treat documents as images of pages, implicitly favoring such representations. In this work, we argue that this paradigm is not well-suited for text-rich multimodal scientific documents, where critical evidence is distributed across structured sources, including text, tables, and figures. To study this setting, we introduce ArXivDoc, a new benchmark constructed from the underlying LaTeX sources of scientific papers. Unlike PDF or image-based representations, LaTeX provides direct access to structured elements (e.g., sections, tables, figures, equations), enabling controlled query construction grounded in specific evidence types. We systematically compare text-only, image-based, and multimodal representations across both single-vector and m
Authors
(none)
Tags
Stats
Related papers
- IRPAPERS: A Visual Document Benchmark For Scientific Retrieval And Question Answering (2026)0.00
- Docmmir: A Framework For Document Multi-modal Information Retrieval (2025)3.46
- Colpali: Efficient Document Retrieval With Vision Language Models (2024)0.00
- Modernvbert: Towards Smaller Visual Document Retrievers (2025)0.00
- On The Representational Limits Of Quantum-inspired 1024-D Document Embeddings: An Experimental Evaluation Framework (2026)0.00
- Document Haystacks: Vision-language Reasoning Over Piles Of 1000+ Documents (2024)2.83
- Visr-bench: An Empirical Study On Visual Retrieval-augmented Generation For Multilingual Long Document Understanding (2025)0.00
- Scimmir: Benchmarking Scientific Multi-modal Information Retrieval (2024)8.07