📊 Datasets — Awesome AI Agents

1,845 datasets & benchmarks — 18 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.

ALFWorldCanonical

ALFWorld is a dataset and benchmark designed to evaluate the ability of language agents to ground their actions in contextual information from their environment.

📄 88 papers

GAIACanonical

GAIA dataset GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). We added gating to prevent bots from scraping the dataset. Please do not reshare the validation or test set in a crawlable format. Data and leaderboard GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to… See the full description on the dataset page: https://huggingface.co/datasets/gaia-benchmark/GAIA.

📄 48 papers

WebShopCanonical

The 'WebShop' dataset is a benchmark used to evaluate the performance of AI agents in a shopping environment, focusing on their ability to interact with and navigate through various tasks related to online shopping.

📄 45 papers

SWE-bench VerifiedCanonical

A human-validated subset of SWE-bench with confirmed-solvable, well-specified issues.

📄 35 papers

SWE-benchCanonical

Real GitHub issues paired with their fix PRs across Python repositories; models must produce patches that pass the repo's tests.

📄 34 papers

LoCoMoEmerging

The 'LOCOMO' dataset/benchmark is used to evaluate long-term conversation capabilities of personalized AI agents by providing a structured framework for assessing their memory and reasoning abilities over extended interactions.

📄 33 papers

HotpotQAEmerging

Dataset Card for BEIR Benchmark hotpotqa is one of the datasets from the Question Answering task within BEIR, measuring Wikipedia article retrieval for a given multi-hop query. Dataset Summary BEIR is a heterogeneous benchmark built from 18 diverse datasets representing 9 information retrieval tasks. Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/hotpotqa.

📄 29 papers

OSWorldCanonical

The 'OSWorld' dataset is a benchmark used to evaluate GUI automation tasks and tool-calling tasks for open-source models.

📄 22 papers

LongMemEvalEmerging

⚠️ This dataset is deprecated. It is replaced by longmemeval-cleaned (https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned) which noisy history sessions that interfere with the answer correctness.

📄 21 papers⬇ 3.0k💛 22🤗 HFmit

ScienceWorldCanonical

ScienceWorld is a benchmark dataset used to evaluate the performance of LLM agents in skill orchestration and execution within structured environments.

📄 21 papers

WebArenaCanonical

WebArena is a benchmark dataset used to evaluate the performance of large language model web agents by measuring their ability to execute structured tool actions based on web interactions.

📄 20 papers

BrowseCompEmerging

BrowseComp is a benchmark that contains complex questions synthesized from live web traversal, used to evaluate the browsing competence of search agents by minimizing test-set contamination and parametric memorization.

📄 17 papers

tau-2-benchEmerging

'Tau-2 Bench' is a dataset used to evaluate the performance of tool-use agents by providing a structured set of tasks that assess interaction dynamics and the effectiveness of various training strategies.

📄 17 papers

BFCLEmerging

The BFCL dataset/benchmark contains instances of tool-selection failures in LLM agents and is used to evaluate the attention mechanisms and decision-making processes of these models in selecting the correct tools.

📄 16 papers

AppWorldEmerging

The 'AppWorld' dataset/benchmark contains a collection of applications and their associated contexts, used to evaluate the ability of language agents to ground user instructions in the relevant environmental information.

📄 15 papers

SMACEmerging

The 'SMAC' dataset/benchmark is used to evaluate multi-agent reinforcement learning algorithms by providing a set of scenarios that require coordination among agents in competitive and cooperative settings.

📄 15 papers

GSM8KEmerging

The 'GSM-8K' dataset is a benchmark that contains a collection of 8,000 grade school math problems used to evaluate the reasoning capabilities of language models.

📄 14 papers

HumanEvalEmerging

HumanEval-X is a benchmark for the evaluation of the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks.

📄 13 papers⬇ 3.9k💛 96🤗 HFapache-2.0

BFCLv-3Emerging

BFCLv-3 is a benchmark dataset used to evaluate the effectiveness of large language model agents in generating and refining tool calls through simulated feedback.

📄 13 papers

SkillsBenchEmerging

Warning: The leaderboard above is generated by Hugging Face eval-results and may be incomplete until evaluation_framework: benchflow is accepted and deployed. The audited SkillsBench v1.1 result archive is https://huggingface.co/datasets/benchflow/skillsbench-leaderboard, with all retained submissions normalized under submissions/skillsbench/v1.1/ and compact official exports under leaderboard/skillsbench/v1.1/. Warning: The dataset is a read-only mirror. The primary source for this benchmark… See the full description on the dataset page: https://huggingface.co/datasets/benchflow/skillsbench.

📄 13 papers

BrowseComp-ZHEmerging

🧭 BrowseComp-ZH: Benchmarking the Web Browsing Ability of Large Language Models in Chinese BrowseComp-ZH is the first high-difficulty benchmark specifically designed to evaluate the real-world web browsing and reasoning capabilities of large language models (LLMs) in the Chinese information ecosystem. Inspired by BrowseComp (Wei et al., 2025), BrowseComp-ZH targets the unique linguistic, structural, and retrieval challenges of the Chinese web, including fragmented platforms… See the full description on the dataset page: https://huggingface.co/datasets/PALIN2018/BrowseComp-ZH.

📄 12 papers⬇ 1.1k💛 7🤗 HFapache-2.0

AndroidWorldCanonical

AndroidWorld is a dataset/benchmark used to evaluate the performance of mobile GUI agents by providing a diverse set of tasks and associated visual contexts from Android applications.

📄 12 papers

StableToolBenchEmerging

StableToolBench is a cost-augmented benchmark that evaluates the performance of budget-constrained tool-augmented agents in solving multi-step tasks while adhering to strict monetary budgets.

📄 12 papers

StarCraft IIEmerging

'StarCraft II' is a complex real-time strategy game used as a benchmark to evaluate cooperative multi-agent reinforcement learning (MARL) algorithms in environments with large joint observation and action spaces.

📄 12 papers

StarCraft Multi-Agent Challenge (SMAC)Emerging

The StarCraft Multi-Agent Challenge (SMAC) is a benchmark that provides simulated environments for evaluating the coordination, negotiation, and cooperation of autonomous artificial agents in multi-agent systems.

📄 12 papers

MinecraftEmerging

Dataset Card for "Minecrafter" More Information needed

📄 11 papers

SWE-Bench LiteEmerging

Dataset Summary SWE-bench Lite is subset of SWE-bench, a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 300 test Issue-Pull Request pairs from 11 popular Python. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Want to run inference now? This dataset only contains the… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/SWE-bench_Lite.

📄 11 papers

TAU-benchCanonical

The 'TAU-bench' dataset/benchmark contains a collection of tasks designed to evaluate AI agents' performance in multi-turn and multi-step interactions involving the use of external tools.

📄 11 papers

Terminal-Bench 2.0Emerging

Warning: The leaderboard above is unofficial. The official leaderboard is https://www.tbench.ai/leaderboard/terminal-bench/2.0, in which entires are audited for correct configuration, results show which agent harness is used, and verified trajectories are publicly viewable. Warning: The dataset is a read-only mirror. The primary source for this dataset is on GitHub: https://github.com/harbor-framework/terminal-bench-2. Please open issues and pull requests there. How this mirror was created… See the full description on the dataset page: https://huggingface.co/datasets/harborframework/terminal-bench-2.0.

📄 11 papers

AgentBenchCanonical

AgentBench is a benchmark designed to evaluate the performance and failure modes of agentic AI systems operating continuously in production environments.

📄 10 papers

GPQAEmerging

Dataset Card for GPQA GPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google. We request that you do not reveal examples from this dataset in plain text or images online, to reduce the risk of leakage into foundation model… See the full description on the dataset page: https://huggingface.co/datasets/Idavidrein/gpqa.

📄 10 papers

MATHEmerging

The 'MATH' dataset is a benchmark that contains mathematical problems used to evaluate the performance of large language models in solving complex reasoning tasks.

📄 10 papers

StarCraft II micromanagement benchmarkEmerging

The 'StarCraft II micromanagement benchmark' is a dataset used to evaluate the performance of multi-agent reinforcement learning algorithms in complex, real-time strategy scenarios involving unit control and decision-making.

📄 10 papers

ToolBenchCanonical

ToolBench is a dataset used to evaluate the performance of agents on various programming tasks, focusing on aspects such as correctness and error handling.

📄 10 papers

Berkeley Function Calling Leaderboard v-3Emerging

The 'Berkeley Function Calling Leaderboard v-3' is a benchmark dataset that contains 200 tasks used to evaluate the performance of function-calling language agents, particularly in relation to the effects of chain-of-thought reasoning on their accuracy.

📄 9 papers

BrowseComp-PlusEmerging

BrowseComp-Plus BrowseComp-Plus is a new benchmark for Deep-Research system, isolating the effect of the retriever and the LLM agent to enable fair, transparent comparisons of Deep-Research agents. The benchmark sources challenging, reasoning-intensive queries from OpenAI's BrowseComp. However, instead of searching the live web, BrowseComp-Plus evaluates against a fixed, curated corpus of ~100K web documents from the web. The corpus includes both human-verified evidence documents… See the full description on the dataset page: https://huggingface.co/datasets/Tevatron/browsecomp-plus.

📄 9 papers

Google Research FootballEmerging

'Google Research Football' is a benchmark that contains a simulated football environment used to evaluate cooperative Multi-Agent Reinforcement Learning (MARL) algorithms.

📄 9 papers

HLEEmerging

[!NOTE] IMPORTANT: Please help us protect the integrity of this benchmark by not publicly sharing, re-uploading, or distributing the dataset. Humanity's Last Exam 🌐 Website | 📄 Paper | GitHub Center for AI Safety & Scale AI Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 2,500 questions across dozens of… See the full description on the dataset page: https://huggingface.co/datasets/cais/hle.

📄 9 papers

OpenClawEmerging

OpenClaw is a community that provides real requests used to define tasks for evaluating computer-use agents in personal assistant scenarios.

📄 9 papers

TravelPlannerEmerging

TravelPlanner Dataset TravelPlanner is a benchmark crafted for evaluating language agents in tool-use and complex planning within multiple constraints. (See our paper for more details.) Introduction In TravelPlanner, for a given query, language agents are expected to formulate a comprehensive plan that includes transportation, daily meals, attractions, and accommodation for each day. TravelPlanner comprises 1,225 queries in total. The number of days and hard constraints… See the full description on the dataset page: https://huggingface.co/datasets/osunlp/TravelPlanner.

📄 9 papers

τ-BenchEmerging

The 'τ-Bench' dataset is used to evaluate the behavioral similarity of tool-use habits among different language model agents by modeling their actions as directed graphs.

📄 9 papers

MMLUEmerging

Dataset Card for MMLU Dataset Summary Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.

📄 8 papers⬇ 473.2k💛 809🤗 HFmit

MoltbookEmerging

Moltbook Dataset A dataset of posts and communities from Moltbook - a Reddit-style social platform designed for AI agents. NOTE: This dataset is a snapshot of Moltbook before it went viral and got flooded with inauthentic accounts such as humans and bots. Files File Records Description moltbook_posts.csv 6,105 All posts from the platform moltbook_submolts.csv 124 All communities (submolts) Dataset Insights Overview… See the full description on the dataset page: https://huggingface.co/datasets/ronantakizawa/moltbook.

📄 8 papers⬇ 160💛 55🤗 HFmit

ALFREDEmerging

ALFRED is a benchmark dataset that contains tasks for embodied agents to evaluate their ability to understand and execute instructions in a simulated environment.

📄 8 papers

LiveCodeBenchEmerging

The 'LiveCodeBench' dataset/benchmark contains a collection of coding tasks and is used to evaluate the performance of large language models in generating correct and robust code solutions.

📄 8 papers

MiniGridEmerging

MiniGrid is a benchmark that contains a set of gridworld environments used to evaluate the performance of embodied agents in tasks involving navigation and interaction.

📄 8 papers

MMLU-ProEmerging

MMLU-Pro Dataset MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines. |Github | 🏆Leaderboard | 📖Paper | 🚀 What's New [2026.03.11] Added more cutting-edge frontier models to the leaderboard, including the Claude-4.6 series, Seed2.0 series, Qwen3.5 series, and Gemini-3.1-Pro, among… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro.

📄 8 papers

Multi-Agent MuJoCoEmerging

'Multi-Agent MuJoCo' is a benchmark that contains various cooperative multi-agent environments used to evaluate the performance and coordination of algorithms in addressing challenges such as non-stationarity and unstable training in multi-agent reinforcement learning.

📄 8 papers

OvercookedEmerging

The 'Overcooked' dataset/benchmark is an environment used to evaluate zero-shot coordination performance among multi-agent systems, focusing on their ability to adapt to unseen teammates.

📄 8 papers

SWE-Bench ProEmerging

Dataset Summary SWE-Bench Pro is a challenging, enterprise-level dataset for testing agent ability on long-horizon software engineering tasks. Paper: https://static.scale.com/uploads/654197dc94d34f66c0f5184e/SWEAP_Eval_Scale%20(9).pdf See the related evaluation Github: https://github.com/scaleapi/SWE-bench_Pro-os Dataset Structure We follow SWE-Bench Verified (https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified) in terms of dataset structure, with several… See the full description on the dataset page: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro.

📄 8 papers

Terminal-BenchEmerging

Terminal-Bench Dataset This dataset contains tasks from Terminal-Bench, a benchmark for evaluating AI agents in real terminal environments. Each task is packaged as a complete, self-contained archive that preserves the exact directory structure, binary files, Docker configurations, and test scripts needed for faithful reproduction. The archive column contains a gzipped tarball of the entire task directory. Dataset Overview Terminal-Bench evaluates AI agents on… See the full description on the dataset page: https://huggingface.co/datasets/ia03/terminal-bench.

📄 8 papers

VirtualHomeEmerging

VirtualHome is a benchmark that contains simulated environments used to evaluate Large Language Models (LLMs) on tasks related to goal interpretation, action sequencing, subgoal decomposition, and transition modeling in embodied AI.

📄 8 papers

Humanity's Last ExamEmerging

'Humanity's Last Exam' is a benchmark used to evaluate the performance of large language models in long-horizon, deep information-seeking research tasks.

kaggle datasets

MuSiQue is a dataset used to evaluate multi-hop question answering systems by providing a collection of questions that require multiple reasoning steps to answer.

Dataset Card for MATH-500 This dataset contains a subset of 500 problems from the MATH benchmark that OpenAI created in their Let's Verify Step by Step paper. See their GitHub repo for the source file: https://github.com/openai/prm800k/tree/main?tab=readme-ov-file#math-splits

📄 6 papers⬇ 174.2k💛 319🤗 HF

StrategyQAEmerging

'StrategyQA' is a benchmark that evaluates the ability of models to reason and strategize in complex scenarios, focusing on high-heterogeneity settings.

📄 6 papers⬇ 30.1k💛 6🤗 HFmit

Search-QAEmerging

We publicly release a new large-scale dataset, called SearchQA, for machine comprehension, or question-answering. Unlike recently released datasets, such as DeepMind CNN/DailyMail and SQuAD, the proposed SearchQA was constructed to reflect a full pipeline of general question-answering. That is, we start not from an existing article and generate a question-answer pair, but start from an existing question-answer pair, crawled from J! Archive, and augment it with text snippets retrieved by Google. Following this approach, we built SearchQA, which consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average. Each question-answer-context tuple of the SearchQA comes with additional meta-data such as the snippet's URL, which we believe will be valuable resources for future research. We conduct human evaluation as well as test two baseline methods, one simple word selection and the other deep learning based, on the SearchQA. We show that there is a meaningful gap between the human and machine performances. This suggests that the proposed dataset could well serve as a benchmark for question-answering.

📄 6 papers⬇ 498💛 23🤗 HFunknown

ACEBenchEmerging

ACEBench Dataset This repository contains the ACEBench dataset, formatted for evaluating and training tool-using language models. The dataset has been processed into a unified structure, with problem descriptions merged with their corresponding ground-truth rubrics. Notebook used to format the dataset: Open in Colab Dataset Structure The dataset is provided under a single configuration, en, which contains three distinct splits: normal: Standard tool-use scenarios.… See the full description on the dataset page: https://huggingface.co/datasets/oliveirabruno01/acebench.

📄 6 papers⬇ 379💛 1🤗 HFmit

Loading datasets…