Realign: Optimizing The Visual Document Retriever With Reasoning-guided Fine-grained Alignment
2026 Β· Hao Yang, Yifan Ji, Zhipeng Xu, et al.
Abstract
Visual document retrieval aims to retrieve a set of document pages relevant to a query from visually rich collections. Existing methods often employ Vision-Language Models (VLMs) to encode queries and visual pages into a shared embedding space, which is then optimized via contrastive training. However, during visual document representation, localized evidence is usually scattered across complex document layouts, making it difficult for retrieval models to capture crucial cues for effective embedding learning. In this paper, we propose Reasoning-Guided Alignment (ReAlign), a method that enhances visual document retrieval by leveraging the reasoning capability of VLMs to provide fine-grained visual document descriptions as supervision signals for training. Specifically, ReAlign employs a superior VLM to identify query-related regions on a page and then generates a query-aware description grounding the cropped visual regions. The retriever is then trained using these region-focused descri
Authors
(none)
Tags
Stats
Related papers
- Attention Grounded Enhancement For Visual Document Retrieval (2025)0.00
- Evo-retriever: Llm-guided Curriculum Evolution With Viewpoint-pathway Collaboration For Multimodal Document Retrieval (2026)0.00
- X-aligner: Composed Visual Retrieval Without The Bells And Whistles (2026)0.00
- Visual-text Cross Alignment: Refining The Similarity Score In Vision-language Models (2024)0.00
- Document Optimization For Black-box Retrieval Via Reinforcement Learning (2026)0.00
- Modernvbert: Towards Smaller Visual Document Retrievers (2025)0.00
- V-retrver: Evidence-driven Agentic Reasoning For Universal Multimodal Retrieval (2026)0.00
- Colpali: Efficient Document Retrieval With Vision Language Models (2024)0.00