Come-vl: Scaling Complementary Multi-encoder Vision-language Learning
2026 Β· Ankan Deria, Komal Kumar, Xilin He, et al.
Abstract
Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injec
Authors
(none)
Tags
Stats
Related papers
- Vlmo: Unified Vision-language Pre-training With Mixture-of-modality-experts (2021)6.34
- Context-adaptive Multi-prompt Embedding With Large Language Models For Vision-language Alignment (2025)0.00
- Unicoder-vl: A Universal Encoder For Vision And Language By Cross-modal Pre-training (2019)20.24
- Vlm2vec: Training Vision-language Models For Massive Multimodal Embedding Tasks (2024)0.00
- Exploring A Unified Vision-centric Contrastive Alternatives On Multi-modal Web Documents (2025)1.69
- M2-encoder: Advancing Bilingual Image-text Understanding By Large-scale Efficient Pretraining (2024)0.00
- Vlm2geovec: Toward Universal Multimodal Embeddings For Remote Sensing (2025)0.00
- Finevit: Progressively Unlocking Fine-grained Perception With Dense Recaptions (2026)0.00