← all datasets

AgentBench

Canonical

9papers using it

2023first seen

AgentBench is a benchmark designed to evaluate the performance and failure modes of agentic AI systems operating continuously in production environments.

🔎 Find this dataset

Papers using AgentBench (9)

Agentbench: Evaluating Llms As Agents2023 · 51 cites

Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents2026

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs2026

What Do Agents Learn From Trajectory-sft: Semantics Or Interfaces?2026

SAGE-32B: Agentic Reasoning Via Iterative Distillation2026

Evaluating Agentic AI In The Wild: Failure Modes, Drift Patterns, And A Production Evaluation Framework2026

Understanding the Challenges in Iterative Generative Optimization with LLMs2026

6GAgentGym: Tool Use, Data Synthesis, and Agentic Learning for Network Management2026

Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning2024

AgentBench — datasets — ai-agents