DAPFAM: A Domain-aware Family-level Dataset To Benchmark Cross Domain Patent Retrieval
2025 Β· Iliass Ayaou, Denis Cavallucci, Hicham Chibane
Abstract
Patent prior-art retrieval becomes especially challenging when relevant disclosures cross technological boundaries. Existing benchmarks lack explicit domain partitions, making it difficult to assess how retrieval systems cope with such shifts. We introduce DAPFAM, a family-level benchmark with explicit IN-domain and OUT-domain partitions defined by a new IPC3 overlap scheme. The dataset contains 1,247 query families and 45,336 target families aggregated at the family level to reduce international redundancy, with citation based relevance judgments. We conduct 249 controlled experiments spanning lexical (BM25) and dense (transformer) backends, document and passage level retrieval, multiple query and document representations, aggregation strategies, and hybrid fusion via Reciprocal Rank Fusion (RRF). Results reveal a pronounced domain gap: OUT-domain performance remains roughly five times lower than IN-domain across all configurations. Passage-level retrieval consistently outperforms doc
Authors
(none)
Tags
Stats
Related papers
- DAPR: A Benchmark On Document-aware Passage Retrieval (2023)5.18
- PARM: A Paragraph Aggregation Retrieval Model For Dense Document-to-document Retrieval (2022)8.35
- Multi-cpr: A Multi Domain Chinese Dataset For Passage Retrieval (2022)0.00
- Succeeding At Scale: Automated Dataset Construction And Query-side Adaptation For Multi-tenant Search (2026)0.00
- A Large-scale Dataset And Benchmark For Similar Trademark Retrieval (2017)0.00
- Prototype-based Semantic Consistency Alignment For Domain Adaptive Retrieval (2025)0.00
- Probability Weighted Compact Feature For Domain Adaptive Retrieval (2020)15.19
- MA-DPR: Manifold-aware Distance Metrics For Dense Passage Retrieval (2025)0.00