The More, The Merrier: Contrastive Fusion For Higher-order Multimodal Alignment
2025 Β· Stefanos Koutoupis, Michaela Areti Zervou, Konstantinos Kontras, et al.
Abstract
Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairw
Authors
(none)
Tags
Stats
Related papers
- Joint Fusion And Encoding: Advancing Multimodal Retrieval From The Ground Up (2025)0.00
- Data-efficient Multimodal Fusion On A Single GPU (2023)10.00
- Dense Multimodal Fusion For Hierarchically Joint Representation (2018)11.49
- Fuselip: Multimodal Embeddings Via Early Fusion Of Discrete Tokens (2025)0.00
- Fuse After Align: Improving Face-voice Association Learning Via Multimodal Encoder (2024)0.00
- ITO: Images And Texts As One Via Synergizing Multiple Alignment And Training-time Fusion (2026)0.00
- Towards Uniformity And Alignment For Multimodal Representation Learning (2026)0.00
- A Mathematical Perspective On Contrastive Learning (2025)0.00