MAIR: A Massive Benchmark For Evaluating Instructed Retrieval
2024 Β· Weiwei Sun, Zhengliang Shi, Jiulong Wu, et al.
Abstract
Recent information retrieval (IR) models are pre-trained and instruction-tuned on massive datasets and tasks, enabling them to perform well on a wide range of tasks and potentially generalize to unseen tasks with instructions. However, existing IR benchmarks focus on a limited scope of tasks, making them insufficient for evaluating the latest IR models. In this paper, we propose MAIR (Massive Instructed Retrieval Benchmark), a heterogeneous IR benchmark that includes 126 distinct IR tasks across 6 domains, collected from existing datasets. We benchmark state-of-the-art instruction-tuned text embedding models and re-ranking models. Our experiments reveal that instruction-tuned models generally achieve superior performance compared to non-instruction-tuned models on MAIR. Additionally, our results suggest that current instruction-tuned text embedding models and re-ranking models still lack effectiveness in specific long-tail tasks. MAIR is publicly available at https://github.com/sunnwei
Authors
(none)
Tags
Stats
Related papers
- Uniir: Training And Benchmarking Universal Multimodal Information Retrievers (2023)10.48
- BEIR: A Heterogenous Benchmark For Zero-shot Evaluation Of Information Retrieval Models (2021)6.67
- Mfollowir: A Multilingual Benchmark For Instruction Following In Retrieval (2025)0.00
- Towards Better Instruction Following Retrieval Models (2025)0.00
- Resources For Brewing BEIR: Reproducible Reference Models And An Official Leaderboard (2023)0.00
- INQUIRE: A Natural World Text-to-image Retrieval Benchmark (2024)5.24
- IRSC: A Zero-shot Evaluation Benchmark For Information Retrieval Through Semantic Comprehension In Retrieval-augmented Generation Scenarios (2024)2.86
- Incompebench: A Permissively Licensed, Fine-grained Benchmark For Music Information Retrieval (2026)0.00