COLA: A Benchmark For Compositional Text-to-image Retrieval
2023 Β· Arijit Ray, Filip Radenovic, Abhimanyu Dubey, et al.
Abstract
Compositional reasoning is a hallmark of human visual intelligence. Yet, despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. To solve Cola, a model must retrieve images with the correct configuration of attributes and objects and avoid choosing a distractor image with the same objects and attributes but in the wrong configuration. Cola contains about 1.2k composed queries of 168 objects and 197 attributes on around 30K images. Our human evaluation finds that Cola is 83.33% accurate, similar to contemporary compositionality benchmarks. Using Cola as a testbed, we explore empirical modeling designs to adapt pre-trained vision-language models to reason compositionally. We explore 6 adaptation strategies on 2 seminal vision-language models, using compositio
Authors
(none)
Tags
Stats
Related papers
- Composed Object Retrieval: Object-level Retrieval Via Composed Expressions (2025)1.91
- Cir-cot: Towards Interpretable Composed Image Retrieval Via End-to-end Chain-of-thought Reasoning (2025)0.00
- Composed Image Retrieval Using Contrastive Learning And Task-oriented Clip-based Features (2023)16.84
- Learning Visual Composition Through Improved Semantic Guidance (2024)0.00
- Prompting Large Vision-language Models For Compositional Reasoning (2024)0.00
- Advancing Compositional Awareness In CLIP With Efficient Fine-tuning (2025)0.00
- ICC++: Explainable Image Retrieval For Art Historical Corpora Using Image Composition Canvas (2022)0.00
- Image Retrieval On Real-life Images With Pre-trained Vision-and-language Models (2021)17.07