Babel-imagenet: Massively Multilingual Evaluation Of Vision-and-language Representations
2023 · Gregor Geigle, Radu Timofte, Goran Glavaš
Abstract
Vision-and-language (VL) models with separate encoders for each modality (e.g., CLIP) have become the go-to models for zero-shot image classification and image-text retrieval. They are, however, mostly evaluated in English as multilingual benchmarks are limited in availability. We introduce Babel-ImageNet, a massively multilingual benchmark that offers (partial) translations of ImageNet labels to 100 languages, built without machine translation or manual annotation. We instead automatically obtain reliable translations by linking them -- via shared WordNet synsets -- to BabelNet, a massively multilingual lexico-semantic network. We evaluate 11 public multilingual CLIP models on zero-shot image classification (ZS-IC) on our benchmark, demonstrating a significant gap between English ImageNet performance and that of high-resource languages (e.g., German or Chinese), and an even bigger gap for low-resource languages (e.g., Sinhala or Lao). Crucially, we show that the models' ZS-IC performa
Authors
(none)
Tags
Stats
Related papers
- Uclip: Parameter-efficient Multilingual Extension Of Vision-language Models With Unpaired Data (2025)0.00
- M2-encoder: Advancing Bilingual Image-text Understanding By Large-scale Efficient Pretraining (2024)0.00
- NLLB-CLIP -- Train Performant Multilingual Image Retrieval Model On A Budget (2023)0.00
- Multilingual Diversity Improves Vision-language Representations (2024)2.26
- Towards Zero-shot Cross-lingual Image Retrieval And Tagging (2021)2.46
- Towards Zero-shot Cross-lingual Image Retrieval (2020)2.46
- Lowclip: Adapting The CLIP Model Architecture For Low-resource Languages In Multimodal Image Retrieval Task (2024)0.00
- Vl-taboo: An Analysis Of Attribute-based Zero-shot Capabilities Of Vision-language Models (2022)0.00