On The Importance Of Text Preprocessing For Multimodal Representation Learning And Pathology Report Generation
2025 Β· Ruben T. Lucassen, Tijn van de Luijtgaarden, Sander P. J. Moonemans, et al.
Abstract
Vision-language models in pathology enable multimodal case retrieval and automated report generation. Many of the models developed so far, however, have been trained on pathology reports that include information which cannot be inferred from paired whole slide images (e.g., patient history), potentially leading to hallucinated sentences in generated reports. To this end, we investigate how the selection of information from pathology reports for vision-language modeling affects the quality of the multimodal representations and generated reports. More concretely, we compare a model trained on full reports against a model trained on preprocessed reports that only include sentences describing the cell and tissue appearances based on the H&E-stained slides. For the experiments, we built upon the BLIP-2 framework and used a cutaneous melanocytic lesion dataset of 42,433 H&E-stained whole slide images and 19,636 corresponding pathology reports. Model performance was assessed using image-to-te
Authors
(none)
Tags
Stats
Related papers
- Pathalign: A Vision-language Model For Whole Slide Images In Histopathology (2024)0.00
- Multimodal Whole Slide Foundation Model For Pathology (2024)12.99
- Accurate And Scalable Multimodal Pathology Retrieval Via Attentive Vision-language Alignment (2025)2.26
- Vision-language Modelling For Radiological Imaging And Reports In The Low Data Regime (2023)0.00
- Towards A Text-based Quantitative And Explainable Histopathology Image Analysis (2024)7.16
- Learning To Read Where To Look: Disease-aware Vision-language Pretraining For 3D CT (2026)0.00
- Medprobclip: Probabilistic Adaptation Of Vision-language Foundation Model For Reliable Radiograph-report Retrieval (2026)0.00
- Lvlm-aware Multimodal Retrieval For Rag-based Medical Diagnosis With General-purpose Models (2025)0.00