Teaching large language models to reason like expert diagnosticians

Abstract

arXiv:2509.12194v2 Announce Type: replace Abstract: Differential diagnosis is an iterative process that integrates patient information with broader medical knowledge. Clinical case series such as the NEJM Clinicopathologic Conferences (CPCs), published continuously since 1923, feature expert physicians who demonstrate diagnostic reasoning to peers, and have been used for decades to evaluate AI. However, prior AI evaluations have largely focused on final diagnostic accuracy rather than nuanced clinical reasoning. Here, we introduce Dr. CaBot, an agentic AI system that emulates an expert diagnostician by generating written and narrated slide-based presentations from an initial case description alone. CaBot recently generated the first AI diagnosis published in the 100+ year history of the NEJM CPCs. In blinded evaluations, physicians misclassified the source of the differential (CaBot vs. physician-written) in 46/62 (74%) of trials and rated them favorably across quality dimensions. When tasked with solving cases for 72 patients with undiagnosed disease from the NIH Undiagnosed Diseases Network, CaBot identified the working diagnosis in 50/72 (69%) of cases from referral notes alone. To promote transparency and research, we also developed CPC-Bench, a physician-validated benchmark based on 7,102 CPCs and 47,648 questions across 10 tasks. We show that CaBot outperforms frontier models on CPC-Bench, and release both CaBot and CPC-Bench publicly to foster progress in clinical AI.

Abstract

Related papers