TAU-bench

Canonical

11papers using it

2025first seen

The 'TAU-bench' dataset/benchmark contains a collection of tasks designed to evaluate AI agents' performance in multi-turn and multi-step interactions involving the use of external tools.

🔎 Find this dataset

Papers using TAU-bench (11)

Self-Challenging Language Model Agents2025 · 1 cites

SAMULE: Self-Learning Agents Enhanced by Multi-level Reflection2025 · 6 cites

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning2026

RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation2026

MAVEN: Improving Generalization in Agentic Tool Calling2026

Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors2026

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use2026

Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience2026

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration2026

On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset2025

Establishing Best Practices For Building Rigorous Agentic Benchmarks2025