Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory

Abstract

arXiv:2601.22984v2 Announce Type: replace Abstract: Diagnosing failure patterns in Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring intermediate hallucinations that accumulate throughout the research trajectory. To bridge this gap, we propose a shift from outcome-based to processaware evaluation by auditing hallucinations in the full plan-search-summarize trajectory. We introduce the PING Taxonomy, which categorizes DRA hallucinations into four complementary types: Propagation, Intent, Noiseinduced, and Grounding. We further instantiate this taxonomy into a fine-grained evaluation framework that decomposes trajectories into atomic actions, claims, and sub-queries for rigorous verification. Leveraging this framework to isolate 100 distinctively hallucinationprone tasks including adversarial scenarios, we curate DeepHalluBench. Experiments on six representative DRAs show that, on our hallucination-prone stress-test set, all evaluated systems still exhibit non-negligible reliability gaps. Furthermore, our diagnostic analysis traces these failures to systemic deficits, especially hallucination propagation and cognitive biases, providing actionable insights for future architectural optimization. Code and data are available in https://github.com/yuhao-zhan/DeepHalluBench.

Abstract

Code

Related papers