Med3dvlm: An Efficient Vision-language Model For 3D Medical Image Analysis
2025 Β· Yu Xin, Gorkem Can Ates, Kuang Gong, et al.
Abstract
Vision-language models (VLMs) have shown promise in 2D medical image analysis, but extending them to 3D remains challenging due to the high computational demands of volumetric data and the difficulty of aligning 3D spatial features with clinical text. We present Med3DVLM, a 3D VLM designed to address these challenges through three key innovations: (1) DCFormer, an efficient encoder that uses decomposed 3D convolutions to capture fine-grained spatial features at scale; (2) SigLIP, a contrastive learning strategy with pairwise sigmoid loss that improves image-text alignment without relying on large negative batches; and (3) a dual-stream MLP-Mixer projector that fuses low- and high-level image features with text embeddings for richer multi-modal representations. We evaluate our model on the M3D dataset, which includes radiology reports and VQA data for 120,084 3D medical images. Results show that Med3DVLM achieves superior performance across multiple benchmarks. For image-text retrieval,
Authors
(none)
Tags
Stats
Related papers
- Vision-language Modelling For Radiological Imaging And Reports In The Low Data Regime (2023)0.00
- Learning To Read Where To Look: Disease-aware Vision-language Pretraining For 3D CT (2026)0.00
- Exploring The Capabilities Of LLM Encoders For Image-text Retrieval In Chest X-rays (2025)0.00
- Efficient Medical Vision-language Alignment Through Adapting Masked Vision Models (2025)5.74
- Lvlm-aware Multimodal Retrieval For Rag-based Medical Diagnosis With General-purpose Models (2025)0.00
- Pali-3 Vision Language Models: Smaller, Faster, Stronger (2023)0.00
- Villa: Fine-grained Vision-language Representation Learning From Real-world Data (2023)8.82
- BIMCV-R: A Landmark Dataset For 3D CT Text-image Retrieval (2024)8.09