MURE: Hierarchical Multi-resolution Encoding Via Vision-language Models For Visual Document Retrieval
2026 Β· Fengbin Zhu, Zijing Cai, Yuzhe Wang, et al.
Abstract
Visual Document Retrieval (VDR) requires representations that capture both fine-grained visual details and global document structure to ensure retrieval efficacy while maintaining computational efficiency. Existing VDR models struggle to balance effectiveness and efficiency when processing high-resolution documents: they often either lose fine-grained information or generate an excessive number of visual tokens, resulting in significant indexing overhead and high retrieval latency. In this work, we rethink the visual encoding mechanism and propose a new X-VisEmb paradigm that progresses from multi-resolution sampling and encoding, through cross-granularity feature fusion, to adaptive representation distillation. A preliminary study validates its feasibility and effectiveness in capturing complementary visual cues at varying scales. Building on the insights, we develop MURE, a novel framework that employs VLMs as a hierarchical multi-resolution encoder, integrates resolution-level Matry
Authors
(none)
Tags
Stats
Related papers
- Unlocking Multimodal Document Intelligence: From Current Triumphs To Future Frontiers Of Visual Document Retrieval (2026)0.00
- Sculpting The Vector Space: Towards Efficient Multi-vector Visual Document Retrieval Via Prune-then-merge Framework (2026)0.00
- SERVAL: Surprisingly Effective Zero-shot Visual Document Retrieval Powered By Large Vision And Language Models (2025)0.00
- Nanovdr: Distilling A 2B Vision-language Retriever Into A 70M Text-only Encoder For Visual Document Retrieval (2026)0.00
- Modernvbert: Towards Smaller Visual Document Retrievers (2025)0.00
- Docpruner: A Storage-efficient Framework For Multi-vector Visual Document Retrieval Via Adaptive Patch-level Embedding Pruning (2025)0.00
- Evo-retriever: Llm-guided Curriculum Evolution With Viewpoint-pathway Collaboration For Multimodal Document Retrieval (2026)0.00
- M3DR: Towards Universal Multilingual Multimodal Document Retrieval (2025)0.00