A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents

Abstract

arXiv:2604.17943v2 Announce Type: replace Abstract: RAG-based question-answering (QA) in specialist domains faces a cold-start problem: lack of evaluative benchmarks and absence of labeled data for post-training. We present DoRA (Domain-oriented RAG Assessment), a novel benchmark construction and evaluation framework using only a small set of specialist domain documents. DoRA systematically generates synthetic QA training and evaluation datasets with auditable evidence across five domain-specific intents. To mitigate same-pipeline circularity, DoRA's training and test splits use different LLM families (Claude Sonnet for training; GPT-4o for test) drawn from disjoint seed-document corpora. Instantiated on 40 defense-related documents (written in English), DoRA yields ~6.6K curated instances. Compared against 8 LLM baselines over a benchmark of 1,259 samples, a LoRA-adapted Llama3.1-8B trained on the synthetic training set consistently improves performance over 6 coverage and faithfulness metrics, especially reducing hallucination by more than half under the default GTE retrieval setting, with gains persisting across alternative retrievers and prompting-based baselines. Defense-domain expertise is incorporated in three stages of our evaluation: (a) determining the quality of the synthetic QA generated by DoRA, (b) ascertaining the reliability of LLM-as-judge scores, and (c) evaluating the generalization of the QA pipeline on completely human-written QA examples. We position DoRA as a practical framework for specialist-domain RAG under domain shift, with defense as a high-stakes case study.

Abstract

Related papers