tau-2-bench
Emerging15papers using it
2025first seen
'Tau2 Bench' is a dataset/benchmark used to evaluate the performance of tool-use agents by providing structured tasks that capture interaction dynamics and the effectiveness of different strategies in tool invocation and environmental response.
Papers using tau-2-bench (15)
- Towards General Agentic Intelligence via Environment ScalingFrom Confident Closing to Silent Failure: Characterizing False Success in LLM AgentsAgent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM AgentsProper Scoring Rules for Agentic Uncertainty QuantificationSkillsInjector: Dynamic Skill Context Construction for LLM AgentsMemGym: a Long-Horizon Memory Environment for LLM AgentsMAVEN: Improving Generalization in Agentic Tool CallingRobust Tool Use via Fission-GRPO: Learning to Recover from Execution ErrorsTopoCurate:Modeling Interaction Topology for Tool-Use Agent TrainingEnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RLStep 3.5 Flash: Open Frontier-level Intelligence With 11B Active ParametersToward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM AgentsAutoForge: Automated Environment Synthesis for Agentic Reinforcement LearningOn Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN DatasetToolorchestra: Elevating Intelligence Via Efficient Model And Tool Orchestration