Towards Uniformity And Alignment For Multimodal Representation Learning
2026 Β· Wenzhe Yin, Pan Zhou, Zehao Xiao, et al.
Abstract
Multimodal representation learning aims to construct a shared embedding space in which heterogeneous modalities are semantically aligned. Despite strong empirical results, InfoNCE-based objectives introduce inherent conflicts that yield distribution gaps across modalities. In this work, we identify two conflicts in the multimodal regime, both exacerbated as the number of modalities increases: (i) an alignment-uniformity conflict, whereby the repulsion of uniformity undermines pairwise alignment, and (ii) an intra-alignment conflict, where aligning multiple modalities induces competing alignment directions. To address these issues, we propose a principled decoupling of alignment and uniformity for multimodal representations, providing a conflict-free recipe for multimodal learning that simultaneously supports discriminative and generative use cases without task-specific modules. We then provide a theoretical guarantee that our method acts as an efficient proxy for a global H\"older dive
Authors
(none)
Tags
Stats
Related papers
- Multimodal Representation Alignment For Cross-modal Information Retrieval (2025)0.00
- Geodesic Multi-modal Mixup For Robust Fine-tuning (2022)5.39
- Modest-align: Data-efficient Alignment For Vision-language Models (2025)0.00
- Modality Curation: Building Universal Embeddings For Advanced Multimodal Information Retrieval (2025)0.00
- With Limited Data For Multimodal Alignment, Let The STRUCTURE Guide You (2025)0.00
- Learning Unseen Modality Interaction (2023)0.00
- The More, The Merrier: Contrastive Fusion For Higher-order Multimodal Alignment (2025)0.00
- A TRIANGLE Enables Multimodal Alignment Beyond Cosine Similarity (2025)0.00