Using Multiple Instance Learning To Build Multimodal Representations
2022 Β· Peiqi Wang, William M. Wells, Seth Berkowitz, et al.
Abstract
Image-text multimodal representation learning aligns data across modalities and enables important medical applications, e.g., image classification, visual grounding, and cross-modal retrieval. In this work, we establish a connection between multimodal representation learning and multiple instance learning. Based on this connection, we propose a generic framework for constructing permutation-invariant score functions with many existing multimodal representation learning approaches as special cases. Furthermore, we use the framework to derive a novel contrastive learning approach and demonstrate that our method achieves state-of-the-art results in several downstream tasks.
Authors
(none)
Tags
Stats
Related papers
- Multimodal Contrastive Training For Visual Representation Learning (2021)16.32
- Multimodal Representation Learning Conditioned On Semantic Relations (2025)0.00
- Benchmarking Vision-language Contrastive Methods For Medical Representation Learning (2024)0.00
- A Mathematical Perspective On Contrastive Learning (2025)0.00
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Multimodal Representation Alignment For Cross-modal Information Retrieval (2025)0.00
- Explaining And Mitigating The Modality Gap In Contrastive Multimodal Learning (2024)0.00
- MXM-CLR: A Unified Framework For Contrastive Learning Of Multifold Cross-modal Representations (2023)0.00