Learning To Read Where To Look: Disease-aware Vision-language Pretraining For 3D CT
2026 Β· Simon Ging, Philipp Arnold, Sebastian Walter, et al.
Abstract
Recent 3D CT vision-language models align volumes with reports via contrastive pretraining, but typically rely on limited public data and provide only coarse global supervision. We train a 3D CT vision-language model on 98k report-volume pairs (50k patients) collected at a single hospital, combined with public datasets, using SigLIP-style contrastive pretraining together with prompt-based disease supervision in the shared vision-text embedding space. On CT-RATE, our model achieves state-of-the-art text-to-image retrieval (R@10 31.5 vs. 22.2) and competitive disease classification (AUC 83.8 vs. 83.8), with consistent results on Rad-ChestCT (AUC 77.0 vs. 77.3). We further observe that radiologists routinely reference specific images within their reports (e.g., ``series X, image Y''), linking textual descriptions to precise axial locations. We automatically mine 262k such snippet-slice pairs and introduce the task of intra-scan snippet localization -- predicting the axial depth referred t
Authors
(none)
Tags
Stats
Related papers
- Vision-language Modelling For Radiological Imaging And Reports In The Low Data Regime (2023)0.00
- Beyond The Embedding Bottleneck: Adaptive Retrieval-augmented 3D CT Report Generation (2026)0.00
- Exploring The Capabilities Of LLM Encoders For Image-text Retrieval In Chest X-rays (2025)0.00
- Med3dvlm: An Efficient Vision-language Model For 3D Medical Image Analysis (2025)12.60
- Selip: Similarity Enhanced Contrastive Language Image Pretraining For Multi-modal Head MRI (2025)3.58
- BIMCV-R: A Landmark Dataset For 3D CT Text-image Retrieval (2024)8.09
- On The Importance Of Text Preprocessing For Multimodal Representation Learning And Pathology Report Generation (2025)0.00
- Multi-level CLS Token Fusion For Contrastive Learning In Endoscopy Image Classification (2025)0.00