RAMM: Retrieval-augmented Biomedical Visual Question Answering With Multi-modal Pre-training
2023 Β· Zheng Yuan, Qiao Jin, Chuanqi Tan, et al.
Abstract
Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA). Compared to general domain VQA, the performance of biomedical VQA suffers from limited data. In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA to overcome the data limitation issue. Specifically, we collect a new biomedical dataset named PMCPM which offers patient-based image-text pairs containing diverse patient situations from PubMed. Then, we pretrain the biomedical multi-modal model to learn visual and textual representation for image-text pairs and align these representations with image-text contrastive objective (ITC). Finally, we propose a retrieval-augmented method to better use the limited data. We propose to retrieve similar image-text pairs based on ITC from pretraining datasets and introduce a novel retrieval-attention module to fuse the representation of the image and the question with the retrie
Authors
(none)
Tags
Stats
Related papers
- Lvlm-aware Multimodal Retrieval For Rag-based Medical Diagnosis With General-purpose Models (2025)0.00
- Fine-grained Late-interaction Multi-modal Retrieval For Retrieval Augmented Visual Question Answering (2023)5.24
- VQA4CIR: Boosting Composed Image Retrieval With Visual Question Answering (2023)5.24
- Benchmarking Vision-language Contrastive Methods For Medical Representation Learning (2024)0.00
- Cross-modal Retrieval Augmentation For Multi-modal Classification (2021)9.23
- Vision-language Modelling For Radiological Imaging And Reports In The Low Data Regime (2023)0.00
- A Systematic Study Of Retrieval Pipeline Design For Retrieval-augmented Medical Question Answering (2026)0.00
- Mllms-augmented Visual-language Representation Learning (2023)0.00