← all datasets

Terminal-Bench 2.0

Emerging

12papers using it

2026first seen

Terminal-Bench 2.0 is a benchmark dataset used to evaluate the performance and evolution of self-evolving LLM-based agents across various tasks and metrics.

🔎 Find this dataset

Papers using Terminal-Bench 2.0 (12)

Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills2026

HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness2026

LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents2026

Terminal-World: Scaling Terminal-Agent Environments via Agent Skills2026

ECHO: Terminal Agents Learn World Models for Free2026

What Makes Interaction Trajectories Effective for Training Terminal Agents?2026

APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents2026

SEAGym: An Evaluation Environment for Self-Evolving LLM Agents2026

Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages2026

unix-ctf: Procedural Environments for Unix-Competence Reinforcement Learning2026

Terminal-bench: Benchmarking Agents On Hard, Realistic Tasks In Command Line Interfaces2026

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution2026

Terminal-Bench 2.0 — datasets — ai-agents