Abstract
Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable. To address these issues, we propose AgentCE-Bench built around a unified grid-based planning task, where agents must fill hidden slots in a partially completed schedule subject to both local slot constraints and global constraints. Our benchmark offers fine-grained control through two orthogonal axes: \textbf\{Scalable Horizons\}, controlled by the number of hidden slots , and \textbf\{Controllable Difficulty\}, governed by a decoy budget that determines the number of globally misleading decoy candidates. Crucially, all tool calls are resolved via static JSON files under a \textbf\{Lightweight Environment\} design, eliminating setup overhead and enabling fast, reproducible evaluation suitable for training-time validation. We first validate that