Lvlm-aware Multimodal Retrieval For Rag-based Medical Diagnosis With General-purpose Models
2025 Β· Nir Mazor, Tom Hope
Abstract
Retrieving visual and textual information from medical literature and hospital records can enhance diagnostic accuracy for clinical image interpretation. However, multimodal retrieval-augmented diagnosis is highly challenging. We explore a lightweight mechanism for enhancing diagnostic performance of retrieval-augmented LVLMs. We train a lightweight LVLM-aware multimodal retriever, such that the retriever learns to return images and texts that guide the LVLM toward correct predictions. In our low-resource setting, we perform only lightweight fine-tuning with small amounts of data, and use only general-purpose backbone models, achieving competitive results in clinical classification and VQA tasks compared to medically pre-trained models with extensive training. In a novel analysis, we highlight a previously unexplored class of errors that we term inconsistent retrieval predictions: cases where different top-retrieved images yield different predictions for the same target. We find that t
Authors
(none)
Tags
Stats
Related papers
- Vision-language Modelling For Radiological Imaging And Reports In The Low Data Regime (2023)0.00
- Mrag-bench: Vision-centric Evaluation For Retrieval-augmented Multimodal Models (2024)0.00
- Lamra: Large Multimodal Model As Your Advanced Retrieval Assistant (2024)7.50
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00
- M3retrieve: Benchmarking Multimodal Retrieval For Medicine (2025)2.16
- Benchmarking Vision-language Contrastive Methods For Medical Representation Learning (2024)0.00
- Med3dvlm: An Efficient Vision-language Model For 3D Medical Image Analysis (2025)12.60
- A Systematic Study Of Retrieval Pipeline Design For Retrieval-augmented Medical Question Answering (2026)0.00