Systematic Evaluation Of Neural Retrieval Models On The Touch\'e 2020 Argument Retrieval Subset Of BEIR
2024 · Nandan Thakur, Luiz Bonifacio, Maik Fröbe, et al.
Abstract
The zero-shot effectiveness of neural retrieval models is often evaluated on the BEIR benchmark -- a combination of different IR evaluation datasets. Interestingly, previous studies found that particularly on the BEIR subset Touch\'e 2020, an argument retrieval task, neural retrieval models are considerably less effective than BM25. Still, so far, no further investigation has been conducted on what makes argument retrieval so "special". To more deeply analyze the respective potential limits of neural retrieval models, we run a reproducibility study on the Touch\'e 2020 data. In our study, we focus on two experiments: (i) a black-box evaluation (i.e., no model retraining), incorporating a theoretical exploration using retrieval axioms, and (ii) a data denoising evaluation involving post-hoc relevance judgments. Our black-box evaluation reveals an inherent bias of neural models towards retrieving short passages from the Touch\'e 2020 data, and we also find that quite a few of the neural
Authors
(none)
Tags
Stats
Related papers
- BEIR: A Heterogenous Benchmark For Zero-shot Evaluation Of Information Retrieval Models (2021)6.67
- Resources For Brewing BEIR: Reproducible Reference Models And An Official Leaderboard (2023)0.00
- Can Old TREC Collections Reliably Evaluate Modern Neural Retrieval Models? (2022)0.00
- Diagnosing BERT With Retrieval Heuristics (2022)10.21
- Hard Negatives, Hard Lessons: Revisiting Training Data Quality For Robust Information Retrieval With Llms (2025)2.26
- A Deep Look Into Neural Ranking Models For Information Retrieval (2019)17.73
- Beyond Precision: A Study On Recall Of Initial Retrieval With Neural Representations (2018)4.52
- Colbertv2: Effective And Efficient Retrieval Via Lightweight Late Interaction (2021)17.46