Bivlc: Extending Vision-language Compositionality Evaluation With Text-to-image Retrieval
2024 Β· Imanol Miranda, Ander Salaberria, Eneko Agirre, et al.
Abstract
Existing Vision-Language Compositionality (VLC) benchmarks like SugarCrepe are formulated as image-to-text retrieval problems, where, given an image, the models need to select between the correct textual description and a synthetic hard negative text. In this work, we present the Bidirectional Vision-Language Compositionality (BiVLC) dataset. The novelty of BiVLC is to add a synthetic hard negative image generated from the synthetic text, resulting in two image-to-text retrieval examples (one for each image) and, more importantly, two text-to-image retrieval examples (one for each text). Human annotators filter out ill-formed examples ensuring the validity of the benchmark. The experiments on BiVLC uncover a weakness of current multimodal models, as they perform poorly in the text-to-image direction. In fact, when considering both retrieval directions, the conclusions obtained in previous works change significantly. In addition to the benchmark, we show that a contrastive model trained
Authors
(none)
Tags
Stats
Related papers
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11
- Tripletclip: Improving Compositional Reasoning Of CLIP Via Synthetic Vision-language Negatives (2024)4.52
- Learning The Visualness Of Text Using Large Vision-language Models (2023)4.52
- Contrastive Vision-language Learning With Paraphrasing And Negation (2025)0.00
- CBVS: A Large-scale Chinese Image-text Benchmark For Real-world Short Video Search Scenarios (2024)2.29
- Advancing Compositional Awareness In CLIP With Efficient Fine-tuning (2025)0.00
- Exploring A Unified Vision-centric Contrastive Alternatives On Multi-modal Web Documents (2025)1.69
- Semantic Compositions Enhance Vision-language Contrastive Learning (2024)0.00