Toward Automatic Relevance Judgment Using Vision--language Models For Image--text Retrieval Evaluation
2024 Β· Jheng-Hong Yang, Jimmy Lin
Abstract
Vision--Language Models (VLMs) have demonstrated success across diverse applications, yet their potential to assist in relevance judgments remains uncertain. This paper assesses the relevance estimation capabilities of VLMs, including CLIP, LLaVA, and GPT-4V, within a large-scale \textit\{ad hoc\} retrieval task tailored for multimedia content creation in a zero-shot fashion. Preliminary experiments reveal the following: (1) Both LLaVA and GPT-4V, encompassing open-source and closed-source visual-instruction-tuned Large Language Models (LLMs), achieve notable Kendall's \(\tau \sim 0.4\) when compared to human relevance judgments, surpassing the CLIPScore metric. (2) While CLIPScore is strongly preferred, LLMs are less biased towards CLIP-based retrieval systems. (3) GPT-4V's score distribution aligns more closely with human judgments than other models, achieving a Cohen's \(\kappa\) value of around 0.08, which outperforms CLIPScore at approximately -0.096. These findings underscore the
Authors
(none)
Tags
Stats
Related papers
- A Little More Like This: Text-to-image Retrieval With Vision-language Models Using Relevance Feedback (2025)0.00
- An Analysis Of Vision-language Models For Fabric Retrieval (2025)0.00
- Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models And Vision Language Models (2024)8.82
- Learning The Visualness Of Text Using Large Vision-language Models (2023)4.52
- Imageref-vl: Enabling Contextual Image Referencing In Vision-language Models (2025)1.91
- Aligning Vision Models With Human Aesthetics In Retrieval: Benchmarks And Algorithms (2024)0.00
- A Multimodal Recaptioning Framework To Account For Perceptual Diversity Across Languages In Vision-language Modeling (2025)0.00
- Contrasting Intra-modal And Ranking Cross-modal Hard Negatives To Enhance Visio-linguistic Compositional Understanding (2023)12.11