Multi-cpr: A Multi Domain Chinese Dataset For Passage Retrieval
2022 Β· Dingkun Long, Qiong Gao, Kuan Zou, et al.
Abstract
Passage retrieval is a fundamental task in information retrieval (IR) research, which has drawn much attention recently. In the English field, the availability of large-scale annotated dataset (e.g, MS MARCO) and the emergence of deep pre-trained language models (e.g, BERT) has resulted in a substantial improvement of existing passage retrieval systems. However, in the Chinese field, especially for specific domains, passage retrieval systems are still immature due to quality-annotated dataset being limited by scale. Therefore, in this paper, we present a novel multi-domain Chinese dataset for passage retrieval (Multi-CPR). The dataset is collected from three different domains, including E-commerce, Entertainment video and Medical. Each dataset contains millions of passages and a certain amount of human annotated query-passage related pairs. We implement various representative passage retrieval methods as baselines. We find that the performance of retrieval models trained on dataset fro
Authors
(none)
Tags
Stats
Related papers
- DAPR: A Benchmark On Document-aware Passage Retrieval (2023)5.18
- Improving Dense Passage Retrieval With Multiple Positive Passages (2025)0.00
- Query-as-context Pre-training For Dense Passage Retrieval (2022)7.68
- Cohort Retrieval Using Dense Passage Retrieval (2025)0.00
- Towards Cross-modal Retrieval In Chinese Cultural Heritage Documents: Dataset And Solution (2025)0.00
- Investigating Multi-layer Representations For Dense Passage Retrieval (2025)0.00
- Augmenting Passage Representations With Query Generation For Enhanced Cross-lingual Dense Retrieval (2023)8.14
- Synthetic Target Domain Supervision For Open Retrieval QA (2022)4.52