📊 Datasets — Awesome AI Agents

2,576 datasets & benchmarks — 18 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.

ALFWorldCanonical

ALFWorld is a benchmark dataset that contains interactive environments designed to evaluate reinforcement learning agents on long-horizon tasks with sparse rewards.

📄 77 papers

GAIACanonical

📄 44 papers

SWE-benchCanonical

'SWE-bench' is a dataset containing 64,380 runs from 126 software engineering agent configurations across 43 frameworks, used to evaluate the behavioral differences and performance outcomes of various LLM-based software engineering agents.

📄 43 papers

SWE-bench VerifiedCanonical

📄 40 papers

WebShopCanonical

The 'WebShop' dataset/benchmark contains a simulated online shopping environment used to evaluate reinforcement learning algorithms in long-horizon agentic tasks.

LoCoMo is a benchmark used to evaluate the performance of large language model agents in dynamic environments, focusing on their ability to adapt to changing tasks and conditions.

📄 30 papers

OSWorldCanonical

OSWorld is a benchmark that contains configurations for multi-application environments used to evaluate the performance of computer-use agents (CUAs) in interacting with graphical desktops.

📄 28 papers

GSM8KEmerging

The GSM8K dataset is a benchmark that contains complex mathematical reasoning problems used to evaluate the reasoning abilities of large language models.

WebArena is a benchmark dataset used to evaluate the performance of large language model web agents by measuring their ability to execute structured tool actions based on web interactions.

📄 19 papers

ScienceWorldCanonical

ScienceWorld is a benchmark dataset used to evaluate the performance of LLM agents in skill orchestration and execution within structured environments.

📄 18 papers

SMACEmerging

📄 18 papers

AndroidWorldCanonical

AndroidWorld is a dataset/benchmark used to evaluate the performance of mobile GUI agents by providing a collection of Android applications and their user interfaces.

The 'AppWorld' dataset/benchmark contains a collection of applications and their associated environments, used to evaluate the performance of language agents in understanding and executing user instructions within complex contexts.

'Tau2 Bench' is a dataset/benchmark used to evaluate the performance of tool-use agents by providing structured tasks that capture interaction dynamics and the effectiveness of different strategies in tool invocation and environmental response.

📄 15 papers

BFCLEmerging

The BFCL dataset/benchmark is used to evaluate the effectiveness of tool-calling capabilities in large language models, focusing on their ability to retrieve and utilize demonstrations for specific tasks.

📄 14 papers

MMLUEmerging

📄 14 papers

LiveCodeBenchEmerging

SWE-bench LiteEmerging

📄 13 papers

BFCLv-3Emerging

The 'BFCLv3' dataset/benchmark is used to evaluate the effectiveness of tool-use agents by providing a structured representation of interaction dynamics through multi-trial rollouts of tasks.

Multi-Agent MuJoCoEmerging

📄 12 papers

StableToolBenchEmerging

StableToolBench is a cost-augmented benchmark that evaluates the performance of budget-constrained tool-augmented agents in solving multi-step tasks while adhering to strict monetary budgets.

📄 12 papers

StarCraft II micromanagement benchmarkEmerging

📄 12 papers

TAU-benchCanonical

📄 12 papers

Terminal-BenchEmerging

📄 12 papers

Terminal-Bench 2.0Emerging

Terminal-Bench 2.0 is a benchmark dataset used to evaluate the performance and evolution of self-evolving LLM-based agents across various tasks and metrics.

📄 12 papers

ToolBenchCanonical

ToolBench is a benchmark containing approximately 47,000 tools used to evaluate the performance of large language models in tool retrieval tasks through various query types and probing methods.

📄 12 papers

ALFREDEmerging

📄 11 papers

BrowseComp-ZHEmerging

📄 11 papers

Google Research FootballEmerging

📄 11 papers

OpenClawEmerging

OpenClaw is a dataset/benchmark used to evaluate the effectiveness of the Runtime Skill Audit (RSA) method by profiling and auditing the security of agent skills in dynamic runtime conditions.

StarCraft Multi-Agent Challenge (SMAC)Emerging

The StarCraft Multi-Agent Challenge (SMAC) is a benchmark that contains simulated environments used to evaluate the coordination, negotiation, and cooperation of autonomous artificial agents in multi-agent systems.

📄 10 papers

TravelPlannerEmerging

📄 10 papers

AgentBenchCanonical

AgentBench is a benchmark designed to evaluate the performance and failure modes of agentic AI systems operating continuously in production environments.

📄 9 papers

AI2-THOREmerging

AI2-THOR is a simulated environment that contains interactive 3D scenes used to evaluate the performance of AI systems in understanding and executing tasks involving complex object interactions.

MuSiQue is a dataset used to evaluate multi-hop question answering systems by providing a collection of questions that require reasoning across multiple pieces of information.

Berkeley Function Calling Leaderboard v-3Emerging

The 'Berkeley Function Calling Leaderboard v3' is a benchmark dataset that contains 200 tasks used to evaluate the performance of function-calling language agents in relation to their reasoning length and accuracy.

MetaWorld is a benchmark that contains a collection of diverse robotic manipulation tasks used to evaluate the performance and robustness of world-model agents in continuous control settings.

📄 8 papers

MuJoCoEmerging

📄 8 papers