Adding Simple Structure At Inference Improves Vision-language Compositionality
2025 Β· Imanol Miranda, Ander Salaberria, Eneko Agirre, et al.
Abstract
Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for image-text retrieval tasks. However, those models struggle with compositionality, showing a bag-of-words-like behavior that limits their retrieval performance. Many different training approaches have been proposed to improve the vision-language compositionality capabilities of those models. In comparison, inference-time techniques have received little attention. In this paper, we propose to add simple structure at inference, where, given an image and a caption: i) we divide the image into different smaller crops, ii) we extract text segments, capturing objects, attributes and relations, iii) using a VLM, we find the image crops that better align with text segments obtaining matches, and iv) we compute the final image-text similarity aggregating the individual similarities of the matches. Based on various popular dual encoder VLMs, we evaluate our approach in controlled and natural datasets for VL compositionalit
Authors
(none)
Tags
Stats
Related papers
- Linear Spaces Of Meanings: Compositional Structures In Vision-language Models (2023)9.41
- Learning Visual Composition Through Improved Semantic Guidance (2024)0.00
- Advancing Compositional Awareness In CLIP With Efficient Fine-tuning (2025)0.00
- Prompting Large Vision-language Models For Compositional Reasoning (2024)0.00
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- Semantic Compositions Enhance Vision-language Contrastive Learning (2024)0.00
- Compositional Image-text Matching And Retrieval By Grounding Entities (2025)0.60
- Leveraging Large Vision-language Model As User Intent-aware Encoder For Composed Image Retrieval (2024)3.58