ToolBench

Canonical

12papers using it

2024first seen

ToolBench is a dataset used to evaluate the performance of agents on various programming tasks, focusing on aspects such as correctness and error handling.

🔎 Find this dataset

Papers using ToolBench (10)

Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning2025 · 14 cites

CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval2026

Adarubric: Task-adaptive Rubrics For LLM Agent Evaluation2026

How Many Tools Should an LLM Agent See? A Chance-Corrected Answer2026

Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use2026

Agenther: Hindsight Experience Replay For LLM Agent Trajectory Relabeling2026

Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents2026

Think-Augmented Function Calling: Improving LLM Parameter Accuracy Through Embedded Reasoning2026

NaviAgent: Graph-Driven Bilevel Planning for Scalable Tool Orchestration2025

Toolplanner: A Tool Augmented LLM For Multi Granularity Instructions With Path Planning And Feedback2024 · 4 cites