Multimodal Representation Learning Conditioned On Semantic Relations
2025 Β· Yang Qiao, Yuntong Hu, Liang Zhao
Abstract
Multimodal representation learning has advanced rapidly with contrastive models such as CLIP, which align image-text pairs in a shared embedding space. However, these models face limitations: (1) they typically focus on image-text pairs, underutilizing the semantic relations across different pairs. (2) they directly match global embeddings without contextualization, overlooking the need for semantic alignment along specific subspaces or relational dimensions; and (3) they emphasize cross-modal contrast, with limited support for intra-modal consistency. To address these issues, we propose Relation-Conditioned Multimodal Learning RCML, a framework that learns multimodal representations under natural-language relation descriptions to guide both feature extraction and alignment. Our approach constructs many-to-many training pairs linked by semantic relations and introduces a relation-guided cross-attention mechanism that modulates multimodal representations under each relation context. The
Authors
(none)
Tags
Stats
Related papers
- Linking Representations With Multimodal Contrastive Learning (2023)0.00
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Multimodal Representation Alignment For Cross-modal Information Retrieval (2025)0.00
- Multimodal Contrastive Training For Visual Representation Learning (2021)16.32
- Explaining And Mitigating The Modality Gap In Contrastive Multimodal Learning (2024)0.00
- Generalized Contrastive Learning For Universal Multimodal Retrieval (2025)0.00
- Using Multiple Instance Learning To Build Multimodal Representations (2022)4.52
- Guiding Cross-modal Representations With MLLM Priors Via Preference Alignment (2025)0.00