Villa: Fine-grained Vision-language Representation Learning From Real-world Data
2023 Β· Maya Varma, Jean-Benoit Delbrouck, Sarah Hooper, et al.
Abstract
Vision-language models (VLMs), such as CLIP and ALIGN, are generally trained on datasets consisting of image-caption pairs obtained from the web. However, real-world multimodal datasets, such as healthcare data, are significantly more complex: each image (e.g. X-ray) is often paired with text (e.g. physician report) that describes many distinct attributes occurring in fine-grained regions of the image. We refer to these samples as exhibiting high pairwise complexity, since each image-text pair can be decomposed into a large number of region-attribute pairings. The extent to which VLMs can capture fine-grained relationships between image regions and textual attributes when trained on such data has not been previously evaluated. The first key contribution of this work is to demonstrate through systematic evaluations that as the pairwise complexity of the training dataset increases, standard VLMs struggle to learn region-attribute relationships, exhibiting performance degradations of up t
Authors
(none)
Tags
Stats
Related papers
- Vision-language Modelling For Radiological Imaging And Reports In The Low Data Regime (2023)0.00
- Large-scale Adversarial Training For Vision-and-language Representation Learning (2020)0.00
- FLAIR: VLM With Fine-grained Language-informed Image Representations (2024)10.14
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00
- Learning The Visualness Of Text Using Large Vision-language Models (2023)4.52
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- Efficient Medical Vision-language Alignment Through Adapting Masked Vision Models (2025)5.74