Benchmarking Vision-language Contrastive Methods For Medical Representation Learning
2024 Β· Shuvendu Roy, Yasaman Parhizkar, Franklin Ogidi, et al.
Abstract
We perform a comprehensive benchmarking of contrastive frameworks for learning multimodal representations in the medical domain. Through this study, we aim to answer the following research questions: (i) How transferable are general-domain representations to the medical domain? (ii) Is multimodal contrastive training sufficient, or does it benefit from unimodal training as well? (iii) What is the impact of feature granularity on the effectiveness of multimodal medical representation learning? To answer these questions, we investigate eight contrastive learning approaches under identical training setups, and train them on 2.8 million image-text pairs from four datasets, and evaluate them on 25 downstream tasks, including classification (zero-shot and linear probing), image-to-text and text-to-image retrieval, and visual question-answering. Our findings suggest a positive answer to the first question, a negative answer to the second question, and the benefit of learning fine-grained feat
Authors
(none)
Tags
Stats
Related papers
- Benchmarking Robustness Of Contrastive Learning Models For Medical Image-report Retrieval (2025)0.00
- Efficient Medical Vision-language Alignment Through Adapting Masked Vision Models (2025)5.74
- Vision-language Modelling For Radiological Imaging And Reports In The Low Data Regime (2023)0.00
- Masked Contrastive Reconstruction For Cross-modal Medical Image-report Retrieval (2023)0.00
- Medclip: Contrastive Learning From Unpaired Medical Images And Text (2022)26.02
- Multimodal Contrastive Training For Visual Representation Learning (2021)16.32
- Multi-task Cross-modal Learning For Chest X-ray Image Retrieval (2026)0.00
- Lvlm-aware Multimodal Retrieval For Rag-based Medical Diagnosis With General-purpose Models (2025)0.00