MS MARCO Web Search: A Large-scale Information-rich Web Dataset With Millions Of Real Click Labels
2024 Β· Qi Chen, Xiubo Geng, Corby Rosset, et al.
Abstract
Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals. In this paper, we introduce MS MARCO Web Search, the first large-scale information-rich web dataset, featuring millions of real clicked query-document labels. This dataset closely mimics real-world web document and query distribution, provides rich information for various kinds of downstream tasks and encourages research in various areas, such as generic end-to-end neural indexer models, generic embedding models, and next generation information access system with large language models. MS MARCO Web Search offers a retrieval benchmark with three web retrieval challenge tasks that demand innovations in both machine learning and information retrieval system research domains. As the first dataset that meets large, real and rich data requirements, MS MARCO Web Search paves the way for future advancements in AI and system research. MS MARCO Web Search dataset is available at: htt
Authors
(none)
Tags
Stats
Related papers
- Ms-shift: An Analysis Of MS MARCO Distribution Shifts On Neural Retrieval (2022)4.52
- The Tale Of Two MS MARCO -- And Their Unfair Comparisons (2023)6.34
- Investigating The Scalability Of Approximate Sparse Retrieval Algorithms To Massive Datasets (2025)5.84
- Webfaq: A Multilingual Collection Of Natural Q&A Datasets For Dense Retrieval (2025)0.00
- Noisy Self-training With Synthetic Queries For Dense Retrieval (2023)0.00
- MARVEL: Unlocking The Multi-modal Capability Of Dense Retrieval Via Visual Module Plugin (2023)9.04
- REAL-MM-RAG: A Real-world Multi-modal Retrieval Benchmark (2025)4.52
- Docmmir: A Framework For Document Multi-modal Information Retrieval (2025)3.46