๐Ÿ“Š Datasets โ€” Awesome AI Agents

2,576 datasets & benchmarks โ€” 18 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.

2576 of 2576 datasets
ALFWorldCanonical

ALFWorld is a benchmark dataset that contains interactive environments designed to evaluate reinforcement learning agents on long-horizon tasks with sparse rewards.

๐Ÿ“„ 77 papers
GAIACanonical
๐Ÿ“„ 44 papers
SWE-benchCanonical

'SWE-bench' is a dataset containing 64,380 runs from 126 software engineering agent configurations across 43 frameworks, used to evaluate the behavioral differences and performance outcomes of various LLM-based software engineering agents.

๐Ÿ“„ 43 papers
SWE-bench VerifiedCanonical
๐Ÿ“„ 40 papers
WebShopCanonical

The 'WebShop' dataset/benchmark contains a simulated online shopping environment used to evaluate reinforcement learning algorithms in long-horizon agentic tasks.

๐Ÿ“„ 39 papers
HotpotQAEmerging
๐Ÿ“„ 31 papers
LoCoMoEmerging

LoCoMo is a benchmark used to evaluate the performance of large language model agents in dynamic environments, focusing on their ability to adapt to changing tasks and conditions.

๐Ÿ“„ 30 papers
OSWorldCanonical

OSWorld is a benchmark that contains configurations for multi-application environments used to evaluate the performance of computer-use agents (CUAs) in interacting with graphical desktops.

๐Ÿ“„ 28 papers
GSM8KEmerging

The GSM8K dataset is a benchmark that contains complex mathematical reasoning problems used to evaluate the reasoning abilities of large language models.

๐Ÿ“„ 25 papers
HumanEvalEmerging
๐Ÿ“„ 25 papers
LIBEROEmerging
๐Ÿ“„ 24 papers
LongMemEvalEmerging
๐Ÿ“„ 20 papers
WebArenaCanonical

WebArena is a benchmark dataset used to evaluate the performance of large language model web agents by measuring their ability to execute structured tool actions based on web interactions.

๐Ÿ“„ 19 papers
ScienceWorldCanonical

ScienceWorld is a benchmark dataset used to evaluate the performance of LLM agents in skill orchestration and execution within structured environments.

๐Ÿ“„ 18 papers
SMACEmerging
๐Ÿ“„ 18 papers
AndroidWorldCanonical

AndroidWorld is a dataset/benchmark used to evaluate the performance of mobile GUI agents by providing a collection of Android applications and their user interfaces.

๐Ÿ“„ 17 papers
BrowseCompEmerging
๐Ÿ“„ 16 papers
StarCraft IIEmerging
๐Ÿ“„ 16 papers
AppWorldEmerging

The 'AppWorld' dataset/benchmark contains a collection of applications and their associated environments, used to evaluate the performance of language agents in understanding and executing user instructions within complex contexts.

๐Ÿ“„ 15 papers
SkillsBenchEmerging
๐Ÿ“„ 15 papers
tau-2-benchEmerging

'Tau2 Bench' is a dataset/benchmark used to evaluate the performance of tool-use agents by providing structured tasks that capture interaction dynamics and the effectiveness of different strategies in tool invocation and environmental response.

๐Ÿ“„ 15 papers
BFCLEmerging

The BFCL dataset/benchmark is used to evaluate the effectiveness of tool-calling capabilities in large language models, focusing on their ability to retrieve and utilize demonstrations for specific tasks.

๐Ÿ“„ 14 papers
MMLUEmerging
๐Ÿ“„ 14 papers
LiveCodeBenchEmerging
๐Ÿ“„ 13 papers
MATHEmerging
๐Ÿ“„ 13 papers
MiniGridEmerging
๐Ÿ“„ 13 papers
SWE-bench LiteEmerging
๐Ÿ“„ 13 papers
BFCLv-3Emerging

The 'BFCLv3' dataset/benchmark is used to evaluate the effectiveness of tool-use agents by providing a structured representation of interaction dynamics through multi-trial rollouts of tasks.

๐Ÿ“„ 12 papers
MinecraftEmerging
๐Ÿ“„ 12 papers
MoltbookEmerging
๐Ÿ“„ 12 papers
Multi-Agent MuJoCoEmerging
๐Ÿ“„ 12 papers
StableToolBenchEmerging

StableToolBench is a cost-augmented benchmark that evaluates the performance of budget-constrained tool-augmented agents in solving multi-step tasks while adhering to strict monetary budgets.

๐Ÿ“„ 12 papers
StarCraft II micromanagement benchmarkEmerging
๐Ÿ“„ 12 papers
TAU-benchCanonical
๐Ÿ“„ 12 papers
Terminal-BenchEmerging
๐Ÿ“„ 12 papers
Terminal-Bench 2.0Emerging

Terminal-Bench 2.0 is a benchmark dataset used to evaluate the performance and evolution of self-evolving LLM-based agents across various tasks and metrics.

๐Ÿ“„ 12 papers
ToolBenchCanonical

ToolBench is a benchmark containing approximately 47,000 tools used to evaluate the performance of large language models in tool retrieval tasks through various query types and probing methods.

๐Ÿ“„ 12 papers
ALFREDEmerging
๐Ÿ“„ 11 papers
BrowseComp-ZHEmerging
๐Ÿ“„ 11 papers
Google Research FootballEmerging
๐Ÿ“„ 11 papers
OpenClawEmerging

OpenClaw is a dataset/benchmark used to evaluate the effectiveness of the Runtime Skill Audit (RSA) method by profiling and auditing the security of agent skills in dynamic runtime conditions.

๐Ÿ“„ 11 papers
VirtualHomeEmerging
๐Ÿ“„ 11 papers
AIME 2024Emerging
๐Ÿ“„ 10 papers
GPQAEmerging
๐Ÿ“„ 10 papers
MBPPEmerging
๐Ÿ“„ 10 papers
StarCraft Multi-Agent Challenge (SMAC)Emerging

The StarCraft Multi-Agent Challenge (SMAC) is a benchmark that contains simulated environments used to evaluate the coordination, negotiation, and cooperation of autonomous artificial agents in multi-agent systems.

๐Ÿ“„ 10 papers
TravelPlannerEmerging
๐Ÿ“„ 10 papers
AgentBenchCanonical

AgentBench is a benchmark designed to evaluate the performance and failure modes of agentic AI systems operating continuously in production environments.

๐Ÿ“„ 9 papers
AI2-THOREmerging

AI2-THOR is a simulated environment that contains interactive 3D scenes used to evaluate the performance of AI systems in understanding and executing tasks involving complex object interactions.

๐Ÿ“„ 9 papers
MMLU-ProEmerging
๐Ÿ“„ 9 papers
MPEEmerging
๐Ÿ“„ 9 papers
MuSiQueEmerging

MuSiQue is a dataset used to evaluate multi-hop question answering systems by providing a collection of questions that require reasoning across multiple pieces of information.

๐Ÿ“„ 9 papers
OvercookedEmerging
๐Ÿ“„ 9 papers
ฯ„-BenchEmerging
๐Ÿ“„ 9 papers
Berkeley Function Calling Leaderboard v-3Emerging

The 'Berkeley Function Calling Leaderboard v3' is a benchmark dataset that contains 200 tasks used to evaluate the performance of function-calling language agents in relation to their reasoning length and accuracy.

๐Ÿ“„ 8 papers
HanabiEmerging
๐Ÿ“„ 8 papers
KaggleEmerging
๐Ÿ“„ 8 papers
MATH-500Emerging
๐Ÿ“„ 8 papers
MetaWorldEmerging

MetaWorld is a benchmark that contains a collection of diverse robotic manipulation tasks used to evaluate the performance and robustness of world-model agents in continuous control settings.

๐Ÿ“„ 8 papers
MuJoCoEmerging
๐Ÿ“„ 8 papers