TerminalBench
Emerging3papers using it
2026first seen
TerminalBench is a dataset used to evaluate the performance of monitors in predicting failures in large language model (LLM) agent tasks based on terminal outcomes.
TerminalBench is a dataset used to evaluate the performance of monitors in predicting failures in large language model (LLM) agent tasks based on terminal outcomes.