A Mathematical Perspective On Contrastive Learning
2025 Β· Ricardo Baptista, Andrew M. Stuart, Son Tran
Abstract
Multimodal contrastive learning is a methodology for linking different data modalities; the canonical example is linking image and text data. The methodology is typically framed as the identification of a set of encoders, one for each modality, that align representations within a common latent space. In this work, we focus on the bimodal setting and interpret contrastive learning as the optimization of (parameterized) encoders that define conditional probability distributions, for each modality conditioned on the other, consistent with the available data. This provides a framework for multimodal algorithms such as crossmodal retrieval, which identifies the mode of one of these conditional distributions, and crossmodal classification, which is similar to retrieval but includes a fine-tuning step to make it task specific. The framework we adopt also gives rise to crossmodal generative models. This probabilistic perspective suggests two natural generalizations of contrastive learning: t
Authors
(none)
Tags
Stats
Related papers
- Explaining And Mitigating The Modality Gap In Contrastive Multimodal Learning (2024)0.00
- Contrastive Learning Of Visual-semantic Embeddings (2021)0.00
- Multimodal Contrastive Training For Visual Representation Learning (2021)16.32
- Generalized Contrastive Learning For Universal Multimodal Retrieval (2025)0.00
- Multimodal Representation Learning Conditioned On Semantic Relations (2025)0.00
- Linking Representations With Multimodal Contrastive Learning (2023)0.00
- Using Multiple Instance Learning To Build Multimodal Representations (2022)4.52
- Crossclr: Cross-modal Contrastive Learning For Multi-modal Video Representations (2021)15.59