The Tale Of Two MS MARCO -- And Their Unfair Comparisons
2023 · Carlos Lassance, Stéphane Clinchant
Abstract
The MS MARCO-passage dataset has been the main large-scale dataset open to the IR community and it has fostered successfully the development of novel neural retrieval models over the years. But, it turns out that two different corpora of MS MARCO are used in the literature, the official one and a second one where passages were augmented with titles, mostly due to the introduction of the Tevatron code base. However, the addition of titles actually leaks relevance information, while breaking the original guidelines of the MS MARCO-passage dataset. In this work, we investigate the differences between the two corpora and demonstrate empirically that they make a significant difference when evaluating a new method. In other words, we show that if a paper does not properly report which version is used, reproducing fairly its results is basically impossible. Furthermore, given the current status of reviewing, where monitoring state-of-the-art results is of great importance, having two differen
Authors
(none)
Tags
Stats
Related papers
- Ms-shift: An Analysis Of MS MARCO Distribution Shifts On Neural Retrieval (2022)4.52
- Blending Learning To Rank And Dense Representations For Efficient And Effective Cascades (2025)0.00
- How Does Generative Retrieval Scale To Millions Of Passages? (2023)10.61
- MS MARCO Web Search: A Large-scale Information-rich Web Dataset With Millions Of Real Click Labels (2024)11.86
- How Different Are Pre-trained Transformers For Text Ranking? (2022)7.81
- Overview Of The TREC 2021 Deep Learning Track (2025)10.85
- Evaluating Dense Passage Retrieval Using Transformers (2022)0.00
- How Train-test Leakage Affects Zero-shot Retrieval (2022)3.58