Prompting Large Vision-language Models For Compositional Reasoning
2024 Β· Timothy Ossowski, Ming Jiang, Junjie Hu
Abstract
Vision-language models such as CLIP have shown impressive capabilities in encoding texts and images into aligned embeddings, enabling the retrieval of multimodal data in a shared embedding space. However, these embedding-based models still face challenges in effectively matching images and texts with similar visio-linguistic compositionality, as evidenced by their performance on the recent Winoground dataset. In this paper, we argue that this limitation stems from two factors: the use of single vector representations for complex multimodal data, and the absence of step-by-step reasoning in these embedding-based methods. To address this issue, we make an exploratory step using a novel generative method that prompts large vision-language models (e.g., GPT-4) to depict images and perform compositional reasoning. Our method outperforms other embedding-based methods on the Winoground dataset, and obtains further improvement of up to 10% accuracy when enhanced with the optimal description.
Authors
(none)
Tags
Stats
Related papers
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- Learning Visual Composition Through Improved Semantic Guidance (2024)0.00
- Linear Spaces Of Meanings: Compositional Structures In Vision-language Models (2023)9.41
- Compositional Image-text Matching And Retrieval By Grounding Entities (2025)0.60
- Advancing Compositional Awareness In CLIP With Efficient Fine-tuning (2025)0.00
- Adding Simple Structure At Inference Improves Vision-language Compositionality (2025)0.00
- COLA: A Benchmark For Compositional Text-to-image Retrieval (2023)0.00
- Why Is Winoground Hard? Investigating Failures In Visuolinguistic Compositionality (2022)11.10