Terminal-Bench 2.0
Emerging3papers using it
2026first seen
Terminal-Bench~2.0 is a benchmark dataset used to evaluate the performance of large language model agents in long-horizon tasks by assessing their interaction with various harnesses.
Terminal-Bench~2.0 is a benchmark dataset used to evaluate the performance of large language model agents in long-horizon tasks by assessing their interaction with various harnesses.