Globaldoc: A Cross-modal Vision-language Framework For Real-world Document Image Retrieval And Classification
2023 Β· Souhail Bakkali, Sanket Biswas, Zuheng Ming, et al.
Abstract
Visual document understanding (VDU) has rapidly advanced with the development of powerful multi-modal language models. However, these models typically require extensive document pre-training data to learn intermediate representations and often suffer a significant performance drop in real-world online industrial settings. A primary issue is their heavy reliance on OCR engines to extract local positional information within document pages, which limits the models' ability to capture global information and hinders their generalizability, flexibility, and robustness. In this paper, we introduce GlobalDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised manner using three novel pretext objective tasks. GlobalDoc improves the learning of richer semantic concepts by unifying language and visual representations, resulting in more transferable models. For proper evaluation, we also propose two novel document-level downstream VDU tasks, Few-Shot Document Image Class
Authors
(none)
Tags
Stats
Related papers
- Modernvbert: Towards Smaller Visual Document Retrievers (2025)0.00
- Colpali: Efficient Document Retrieval With Vision Language Models (2024)0.00
- SERVAL: Surprisingly Effective Zero-shot Visual Document Retrieval Powered By Large Vision And Language Models (2025)0.00
- Unlocking Multimodal Document Intelligence: From Current Triumphs To Future Frontiers Of Visual Document Retrieval (2026)0.00
- Vldeformer: Vision-language Decomposed Transformer For Fast Cross-modal Retrieval (2021)10.21
- M3DR: Towards Universal Multilingual Multimodal Document Retrieval (2025)0.00
- Exploring A Unified Vision-centric Contrastive Alternatives On Multi-modal Web Documents (2025)1.69
- Simpledoc: Multi-modal Document Understanding With Dual-cue Page Retrieval And Iterative Refinement (2025)5.50