Medcpt: Contrastive Pre-trained Transformers With Large-scale Pubmed Search Logs For Zero-shot Biomedical Information Retrieval
2023 Β· Qiao Jin, Won Kim, Qingyu Chen, et al.
Abstract
Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely-integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence
Authors
(none)
Tags
Stats
Related papers
- Medclip: Contrastive Learning From Unpaired Medical Images And Text (2022)26.02
- More: Multi-modal Contrastive Pre-training With Transformers On X-rays, Ecgs, And Diagnostic Report (2024)0.00
- Multi-task Cross-modal Learning For Chest X-ray Image Retrieval (2026)0.00
- Benchmarking Robustness Of Contrastive Learning Models For Medical Image-report Retrieval (2025)0.00
- Pretrain Like Your Inference: Masked Tuning Improves Zero-shot Composed Image Retrieval (2023)2.86
- Beyond Retrieval: Ensembling Cross-encoders And GPT Rerankers With Llms For Biomedical QA (2025)0.00
- SCOT: Self-supervised Contrastive Pretraining For Zero-shot Compositional Retrieval (2025)0.00
- M3ret: Unleashing Zero-shot Multimodal Medical Image Retrieval Via Self-supervision (2025)0.00