Terminal-Bench 2.0
Emerging12papers using it
2026first seen
Terminal-Bench 2.0 is a benchmark dataset used to evaluate the performance and evolution of self-evolving LLM-based agents across various tasks and metrics.
Papers using Terminal-Bench 2.0 (12)
- Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent SkillsHarnessBridge: Learnable Bidirectional Controller for LLM Agent HarnessLiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language AgentsTerminal-World: Scaling Terminal-Agent Environments via Agent SkillsECHO: Terminal Agents Learn World Models for FreeWhat Makes Interaction Trajectories Effective for Training Terminal Agents?APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI AgentsSEAGym: An Evaluation Environment for Self-Evolving LLM AgentsFrontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languagesunix-ctf: Procedural Environments for Unix-Competence Reinforcement LearningTerminal-bench: Benchmarking Agents On Hard, Realistic Tasks In Command Line InterfacesSkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution