Vl-taboo: An Analysis Of Attribute-based Zero-shot Capabilities Of Vision-language Models
2022 Β· Felix Vogel, Nina Shvetsova, Leonid Karlinsky, et al.
Abstract
Vision-language models trained on large, randomly collected data had significant impact in many areas since they appeared. But as they show great performance in various fields, such as image-text-retrieval, their inner workings are still not fully understood. The current work analyses the true zero-shot capabilities of those models. We start from the analysis of the training corpus assessing to what extent (and which of) the test classes are really zero-shot and how this correlates with individual classes performance. We follow up with the analysis of the attribute-based zero-shot learning capabilities of these models, evaluating how well this classical zero-shot notion emerges from large-scale webly supervision. We leverage the recently released LAION400M data corpus as well as the publicly available pretrained models of CLIP, OpenCLIP, and FLAVA, evaluating the attribute-based zero-shot capabilities on CUB and AWA2 benchmarks. Our analysis shows that: (i) most of the classes in popul
Authors
(none)
Tags
Stats
Related papers
- Toward Automatic Relevance Judgment Using Vision--language Models For Image--text Retrieval Evaluation (2024)0.00
- Babel-imagenet: Massively Multilingual Evaluation Of Vision-and-language Representations (2023)2.76
- Towards Zero-shot Cross-lingual Image Retrieval And Tagging (2021)2.46
- Face Recognition In The Age Of CLIP & Billion Image Datasets (2023)0.00
- Towards Zero-shot Cross-lingual Image Retrieval (2020)2.46
- Distilling Vision-language Models On Millions Of Videos (2024)7.50
- An Analysis Of Vision-language Models For Fabric Retrieval (2025)0.00
- A Recipe For Improving Remote Sensing VLM Zero Shot Generalization (2025)0.00