TerminalBench-2
Emerging5papers using it
2026first seen
'TerminalBench-2' is a dataset used to evaluate the performance and capabilities of meta-agents in managing and manipulating agentic execution states during complex tasks.
Papers using TerminalBench-2 (5)
- Dissecting model behavior through agent trajectoriesSandboxed Coding Agents are Competitive Omni-modal Task SolversAgentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent HarnessesShepherd: Enabling Programmable Meta-Agents via Reversible Agentic Execution TracesAutomated Benchmark Auditing for AI Agents and Large Language Models