DAPR: A Benchmark On Document-aware Passage Retrieval
2023 Β· Kexin Wang, Nils Reimers, Iryna Gurevych
Abstract
The work of neural retrieval so far focuses on ranking short texts and is challenged with long documents. There are many cases where the users want to find a relevant passage within a long document from a huge corpus, e.g. Wikipedia articles, research papers, etc. We propose and name this task *Document-Aware Passage Retrieval* (DAPR). While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context. This drives us to build a benchmark for this task including multiple datasets from heterogeneous domains. In the experiments, we extend the SoTA passage retrievers with document context via (1) hybrid retrieval with BM25 and (2) contextualized passage representations, which inform the passage representation with document context. We find despite that hybrid retrieval performs the strongest on the mixture of the easy and the hard queries, it completely fails on the hard queries that require document-context un
Authors
(none)
Tags
Stats
Related papers
- PARM: A Paragraph Aggregation Retrieval Model For Dense Document-to-document Retrieval (2022)8.35
- Query-as-context Pre-training For Dense Passage Retrieval (2022)7.68
- DAPFAM: A Domain-aware Family-level Dataset To Benchmark Cross Domain Patent Retrieval (2025)0.00
- Tempretriever: Fusion-based Temporal Dense Passage Retrieval For Time-sensitive Questions (2025)0.00
- A Passage-based Approach To Learning To Rank Documents (2019)8.60
- Dense Passage Retrieval: Is It Retrieving? (2024)6.34
- Multi-cpr: A Multi Domain Chinese Dataset For Passage Retrieval (2022)0.00
- Synthetic Target Domain Supervision For Open Retrieval QA (2022)4.52