MedicalAgentsBench for Complex Medical Reasoning: Comparing Internalized Reasoning Models versus Externalized Agent-based Frameworks

Yanjun Shao·Xiangru Tang·Jiwoong Sohn·Jiapeng Chen·Yuxuan Liao·Jiayi Zhang·Jinyu Xiang·Fang Wu·Yilun Zhao·Chenglin Wu·Wenqi Shi·Arman Cohan·Mark Gerstein·2025

Google Scholar ↗Semantic Scholar ↗

Multi-Agent Code Agents Benchmarks Evaluation

Abstract

Complex medical reasoning requires integrating heterogeneous clinical evidence across multiple inference steps. Large language models (LLMs) now approach this through two routes: internalized reasoning and externalized agent scaffolding (frameworks that decompose problems collaboratively amongst multiple LLMs). To determine whether these routes are exclusive or complementary, we introduce MedicalAgentsBench, a filtered benchmark of 862 complex clinical questions drawn from the union of eight medical datasets via difficulty-aware curation and contamination screening. Evaluating three internalized reasoning models (DeepSeek-R1, o1-mini, and o3-mini), seven base models, and nine externalized agent-based methods, we find that internalized and externalized approaches each independently improve performance, and that their benefits compound: the highest accuracy is achieved by layering agent workflows onto an internalized reasoning model (i.e., o3-mini + MDAgents with 35.1%). Pareto analysis shows this combination dominates the cost-performance frontier; moreover, lightweight optimization on inexpensive models offers an entry point for resource-constrained settings. Our benchmark is at https://github.com/gersteinlab/MedicalAgentsBench.

Code

gersteinlab/MedicalAgentsBench—

Abstract

Code

Related papers