Exploring A Unified Vision-centric Contrastive Alternatives On Multi-modal Web Documents
2025 Β· Yiqi Lin, Alex Jinpeng Wang, Linjie Li, et al.
Abstract
Contrastive vision-language models such as CLIP have demonstrated strong performance across a wide range of multimodal tasks by learning from aligned image-text pairs. However, their ability to handle complex, real-world web documents remains limited, particularly in scenarios where text and images are interleaved, loosely aligned, or embedded in visual form. To address these challenges, we propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer. VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images, thus eliminating the need for OCR, text tokenization, or modality fusion strategy. To capture complex cross-modal relationships in multimodal web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments, leveraging the inherent coherence of documents without requiring explicitly
Authors
(none)
Tags
Stats
Related papers
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- Linking Representations With Multimodal Contrastive Learning (2023)0.00
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- Generalized Contrastive Learning For Universal Multimodal Retrieval (2025)0.00
- Himo-clip: Modeling Semantic Hierarchy And Monotonicity In Vision-language Alignment (2025)3.01
- Come-vl: Scaling Complementary Multi-encoder Vision-language Learning (2026)0.00
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00