WIT: Wikipedia-based Image Text Dataset For Multimodal Multilingual Machine Learning
2021 Β· Krishna Srinivasan, Karthik Raman, Jiecao Chen, et al.
Abstract
The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large high-quality visio-linguistic datasets for learning complementary information (across image and text modalities). In this paper, we introduce the Wikipedia-based Image Text (WIT) Dataset (https://github.com/google-research-datasets/wit) to better facilitate multimodal, multilingual learning. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal models, as we show when applied to downstream tasks such as image-text retrieval. WIT has four main and unique advantages. First, WIT is the largest multimodal dataset by the number of image-text examples by 3x (at the time of writing). Second, WIT is mass
Authors
(none)
Tags
Stats
Code
Related papers
- Multilingual Diversity Improves Vision-language Representations (2024)2.26
- Entity Image And Mixed-modal Image Retrieval Datasets (2025)1.56
- Mr. Right: Multimodal Retrieval On Representation Of Image With Text (2022)0.00
- Imagebert: Cross-modal Pre-training With Large-scale Weak-supervised Image-text Data (2020)0.00
- Wikimute: A Web-sourced Dataset Of Semantic Descriptions For Music Audio (2023)5.24
- Atomic: An Image/text Retrieval Test Collection To Support Multimedia Content Creation (2023)9.02
- Image Search Using Multilingual Texts: A Cross-modal Learning Approach Between Image And Text (2019)0.00
- Self-supervised Visual Representations For Cross-modal Retrieval (2019)7.50