๐ Datasets โ Awesome AI Agents
2,576 datasets & benchmarks โ 18 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.
ALFWorld is a benchmark dataset that contains interactive environments designed to evaluate reinforcement learning agents on long-horizon tasks with sparse rewards.
'SWE-bench' is a dataset containing 64,380 runs from 126 software engineering agent configurations across 43 frameworks, used to evaluate the behavioral differences and performance outcomes of various LLM-based software engineering agents.
The 'WebShop' dataset/benchmark contains a simulated online shopping environment used to evaluate reinforcement learning algorithms in long-horizon agentic tasks.
LoCoMo is a benchmark used to evaluate the performance of large language model agents in dynamic environments, focusing on their ability to adapt to changing tasks and conditions.
OSWorld is a benchmark that contains configurations for multi-application environments used to evaluate the performance of computer-use agents (CUAs) in interacting with graphical desktops.
The GSM8K dataset is a benchmark that contains complex mathematical reasoning problems used to evaluate the reasoning abilities of large language models.
WebArena is a benchmark dataset used to evaluate the performance of large language model web agents by measuring their ability to execute structured tool actions based on web interactions.
ScienceWorld is a benchmark dataset used to evaluate the performance of LLM agents in skill orchestration and execution within structured environments.
AndroidWorld is a dataset/benchmark used to evaluate the performance of mobile GUI agents by providing a collection of Android applications and their user interfaces.
The 'AppWorld' dataset/benchmark contains a collection of applications and their associated environments, used to evaluate the performance of language agents in understanding and executing user instructions within complex contexts.
'Tau2 Bench' is a dataset/benchmark used to evaluate the performance of tool-use agents by providing structured tasks that capture interaction dynamics and the effectiveness of different strategies in tool invocation and environmental response.
The BFCL dataset/benchmark is used to evaluate the effectiveness of tool-calling capabilities in large language models, focusing on their ability to retrieve and utilize demonstrations for specific tasks.
The 'BFCLv3' dataset/benchmark is used to evaluate the effectiveness of tool-use agents by providing a structured representation of interaction dynamics through multi-trial rollouts of tasks.
StableToolBench is a cost-augmented benchmark that evaluates the performance of budget-constrained tool-augmented agents in solving multi-step tasks while adhering to strict monetary budgets.
Terminal-Bench 2.0 is a benchmark dataset used to evaluate the performance and evolution of self-evolving LLM-based agents across various tasks and metrics.
ToolBench is a benchmark containing approximately 47,000 tools used to evaluate the performance of large language models in tool retrieval tasks through various query types and probing methods.
OpenClaw is a dataset/benchmark used to evaluate the effectiveness of the Runtime Skill Audit (RSA) method by profiling and auditing the security of agent skills in dynamic runtime conditions.
The StarCraft Multi-Agent Challenge (SMAC) is a benchmark that contains simulated environments used to evaluate the coordination, negotiation, and cooperation of autonomous artificial agents in multi-agent systems.
AgentBench is a benchmark designed to evaluate the performance and failure modes of agentic AI systems operating continuously in production environments.
AI2-THOR is a simulated environment that contains interactive 3D scenes used to evaluate the performance of AI systems in understanding and executing tasks involving complex object interactions.
MuSiQue is a dataset used to evaluate multi-hop question answering systems by providing a collection of questions that require reasoning across multiple pieces of information.
The 'Berkeley Function Calling Leaderboard v3' is a benchmark dataset that contains 200 tasks used to evaluate the performance of function-calling language agents in relation to their reasoning length and accuracy.
MetaWorld is a benchmark that contains a collection of diverse robotic manipulation tasks used to evaluate the performance and robustness of world-model agents in continuous control settings.