Entity Image And Mixed-modal Image Retrieval Datasets
2025 Β· Cristian-Ioan Blaga, Paul Suganthan, Sahil Dua, et al.
Abstract
Despite advances in multimodal learning, challenging benchmarks for mixed-modal image retrieval that combines visual and textual information are lacking. This paper introduces a novel benchmark to rigorously evaluate image retrieval that demands deep cross-modal contextual understanding. We present two new datasets: the Entity Image Dataset (EI), providing canonical images for Wikipedia entities, and the Mixed-Modal Image Retrieval Dataset (MMIR), derived from the WIT dataset. The MMIR benchmark features two challenging query types requiring models to ground textual descriptions in the context of provided visual entities: single entity-image queries (one entity image with descriptive text) and multi-entity-image queries (multiple entity images with relational text). We empirically validate the benchmark's utility as both a training corpus and an evaluation set for mixed-modal retrieval. The quality of both datasets is further affirmed through crowd-sourced human annotations. The datase
Authors
(none)
Tags
Stats
Related papers
- EDIS: Entity-driven Image Search Over Multimodal Web Content (2023)6.77
- Docmmir: A Framework For Document Multi-modal Information Retrieval (2025)3.46
- Dynamic Weighted Combiner For Mixed-modal Image Retrieval (2023)10.38
- IDMR: Towards Instance-driven Precise Visual Correspondence In Multimodal Retrieval (2025)2.29
- Deepimagesearch: Benchmarking Multimodal Agents For Context-aware Image Retrieval In Visual Histories (2026)0.00
- Scimmir: Benchmarking Scientific Multi-modal Information Retrieval (2024)8.07
- MM-BRIGHT: A Multi-task Multimodal Benchmark For Reasoning-intensive Retrieval (2026)2.60
- Mr. Right: Multimodal Retrieval On Representation Of Image With Text (2022)0.00