Compositional Image-text Matching And Retrieval By Grounding Entities
2025 · Madhukar Reddy Vongala, Saurabh Srivastava, Jana Košecká
Abstract
Vision-language pretraining on large datasets of images-text pairs is one of the main building blocks of current Vision-Language Models. While with additional training, these models excel in various downstream tasks, including visual question answering, image captioning, and visual commonsense reasoning. However, a notable weakness of pretrained models like CLIP, is their inability to perform entity grounding and compositional image and text matching~\cite\{Jiang2024ComCLIP, yang2023amc, Rajabi2023GroundedVSR, learninglocalizeCVPR24\}. In this work we propose a novel learning-free zero-shot augmentation of CLIP embeddings that has favorable compositional properties. We compute separate embeddings of sub-images of object entities and relations that are localized by the state of the art open vocabulary detectors and dynamically adjust the baseline global image embedding. % The final embedding is obtained by computing a weighted combination of the sub-image embeddings. The resulting embed
Authors
(none)
Tags
Stats
Related papers
- Semantic Compositions Enhance Vision-language Contrastive Learning (2024)0.00
- Prompting Large Vision-language Models For Compositional Reasoning (2024)0.00
- Composed Image Retrieval Using Contrastive Learning And Task-oriented Clip-based Features (2023)16.84
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Advancing Compositional Awareness In CLIP With Efficient Fine-tuning (2025)0.00
- Contextclip: Contextual Alignment Of Image-text Pairs On CLIP Visual Representations (2022)5.84
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11