TerminalBench

Emerging

3papers using it

2026first seen

TerminalBench is a dataset used to evaluate the performance of monitors in predicting failures in large language model (LLM) agent tasks based on terminal outcomes.

🔎 Find this dataset

Papers using TerminalBench (3)

Combee: Scaling Prompt Learning for Self-Improving Language Model Agents2026

PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors2026

Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity2026