Terminal-Bench

Emerging

15papers using it

2025first seen

Terminal-Bench Dataset This dataset contains tasks from Terminal-Bench, a benchmark for evaluating AI agents in real terminal environments. Each task is packaged as a complete, self-contained archive that preserves the exact directory structure, binary files, Docker configurations, and test scripts needed for faithful

🔎 Find this dataset

Papers using Terminal-Bench (9)

Qwen3-Coder-Next Technical Report2026

CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion2026

ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer2026

CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing2026

R2V Agent: Teaching SLMs When to Ask for Help2026

From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python2026

Toward Scalable Terminal Task Synthesis via Skill Graphs2026

AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration2026

SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent2025