Scimmir: Benchmarking Scientific Multi-modal Information Retrieval
2024 Β· Siwei Wu, Yizhi Li, Kang Zhu, et al.
Abstract
Multi-modal information retrieval (MMIR) is a rapidly evolving field, where significant progress, particularly in image-text pairing, has been made through advanced representation learning and cross-modality alignment research. However, current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap, where chart and table images described in scholarly language usually do not play a significant role. To bridge this gap, we develop a specialised scientific MMIR (SciMMIR) benchmark by leveraging open-access paper collections to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents. We further annotate the image-text pairs with two-level subset-subcategory hierarchy annotations to facilitate a more comprehensive evaluation of the baselines. We conducted zero-shot and fine-tuning evaluations o
Authors
(none)
Tags
Stats
Related papers
- Docmmir: A Framework For Document Multi-modal Information Retrieval (2025)3.46
- MM-BRIGHT: A Multi-task Multimodal Benchmark For Reasoning-intensive Retrieval (2026)2.60
- MRMR: A Realistic And Expert-level Multidisciplinary Benchmark For Reasoning-intensive Multimodal Retrieval (2025)0.00
- Entity Image And Mixed-modal Image Retrieval Datasets (2025)1.56
- IRPAPERS: A Visual Document Benchmark For Scientific Retrieval And Question Answering (2026)0.00
- Mmdocir: Benchmarking Multimodal Retrieval For Long Documents (2025)3.58
- Rethinking Benchmarks For Cross-modal Image-text Retrieval (2023)13.11
- M3retrieve: Benchmarking Multimodal Retrieval For Medicine (2025)2.16