ITO: Images And Texts As One Via Synergizing Multiple Alignment And Training-time Fusion
2026 Β· Hanpeng Liu, Yaqian Li, Zidan Wang, et al.
Abstract
Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often obse
Authors
(none)
Tags
Stats
Related papers
- Fuselip: Multimodal Embeddings Via Early Fusion Of Discrete Tokens (2025)0.00
- The More, The Merrier: Contrastive Fusion For Higher-order Multimodal Alignment (2025)0.00
- Multimodal Contrastive Training For Visual Representation Learning (2021)16.32
- CODER: Coupled Diversity-sensitive Momentum Contrastive Learning For Image-text Retrieval (2022)13.72
- Matching Images And Text With Multi-modal Tensor Fusion And Re-ranking (2019)19.77
- Curriculum Learning For Data-efficient Vision-language Alignment (2022)2.26
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- Imagebert: Cross-modal Pre-training With Large-scale Weak-supervised Image-text Data (2020)0.00