AgentBench
Canonical9papers using it
2023first seen
AgentBench is a benchmark designed to evaluate the performance and failure modes of agentic AI systems operating continuously in production environments.
Papers using AgentBench (9)
- Agentbench: Evaluating Llms As AgentsAdapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM AgentsAstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMsWhat Do Agents Learn From Trajectory-sft: Semantics Or Interfaces?SAGE-32B: Agentic Reasoning Via Iterative DistillationEvaluating Agentic AI In The Wild: Failure Modes, Drift Patterns, And A Production Evaluation FrameworkUnderstanding the Challenges in Iterative Generative Optimization with LLMs6GAgentGym: Tool Use, Data Synthesis, and Agentic Learning for Network ManagementEnhancing the General Agent Capabilities of Low-Parameter LLMs through
Tuning and Multi-Branch Reasoning