MXM-CLR: A Unified Framework For Contrastive Learning Of Multifold Cross-modal Representations
2023 Β· Ye Wang, Bowei Jiang, Changqing Zou, et al.
Abstract
Multifold observations are common for different data modalities, e.g., a 3D shape can be represented by multi-view images and an image can be described with different captions. Existing cross-modal contrastive representation learning (XM-CLR) methods such as CLIP are not fully suitable for multifold data as they only consider one positive pair and treat other pairs as negative when computing the contrastive loss. In this paper, we propose MXM-CLR, a unified framework for contrastive learning of multifold cross-modal representations. MXM-CLR explicitly models and learns the relationships between multifold observations of instances from different modalities for more comprehensive representation learning. The key of MXM-CLR is a novel multifold-aware hybrid loss which considers multiple positive observations when computing the hard and soft relationships for the cross-modal data pairs. We conduct quantitative and qualitative comparisons with SOTA baselines for cross-modal retrieval tasks
Authors
(none)
Tags
Stats
Related papers
- Crossclr: Cross-modal Contrastive Learning For Multi-modal Video Representations (2021)15.59
- Muco: Multi-turn Contrastive Learning For Multimodal Embedding Model (2026)2.71
- Generalized Contrastive Learning For Universal Multimodal Retrieval (2025)0.00
- Multi-task Cross-modal Learning For Chest X-ray Image Retrieval (2026)0.00
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Multimodal Representation Learning Conditioned On Semantic Relations (2025)0.00
- Multimodal Contrastive Training For Visual Representation Learning (2021)16.32
- Deep Reversible Consistency Learning For Cross-modal Retrieval (2025)7.81