πŸ“Š Datasets β€” Awesome AI Agents

317 datasets & benchmarks β€” 18 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.

317 of 317 datasets
MNISTEmerging

Dataset Card for MNIST Dataset Summary The MNIST dataset consists of 70,000 28x28 black-and-white images of handwritten digits extracted from two NIST databases. There are 60,000 images in the training dataset and 10,000 images in the validation dataset, one class per digit so a total of 10 classes, with 7,000 images (6,000 train images and 1,000 test images) per class. Half of the image were drawn by Census Bureau employees and the other half by high school students… See the full description on the dataset page: https://huggingface.co/datasets/ylecun/mnist.

πŸ“„ 5 papers⬇ 149.3kπŸ’› 248πŸ€— HFmit
SWE-benchCanonical
πŸ“„ 3 papers
GAIACanonical

GAIA dataset GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). We added gating to prevent bots from scraping the dataset. Please do not reshare the validation or test set in a crawlable format. Data and leaderboard GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to… See the full description on the dataset page: https://huggingface.co/datasets/gaia-benchmark/GAIA.

πŸ“„ 2 papers⬇ 42.2kπŸ’› 692πŸ€— HF
LIBEROEmerging

This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v3.0", "robot_type": "panda", "total_episodes": 1693, "total_frames": 273465, "total_tasks": 40, "chunks_size": 1000, "fps": 10.0, "splits": { "train": "0:1693" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path": "videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4"… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceVLA/libero.

πŸ“„ 2 papers⬇ 23.2kπŸ’› 60πŸ€— HFapache-2.0
CIFAR-10Emerging
πŸ“„ 2 papers⬇ 1.7kπŸ€— HF
USAMO 2026Emerging

Homepage and repository Homepage: https://matharena.ai/ Repository: https://github.com/eth-sri/matharena Dataset Summary This dataset contains the questions from USAMO 2026 used for the MathArena Leaderboard Data Fields The dataset contains the following fields: problem_idx (int64): Problem index within the corresponding MathArena benchmark. points (int64): Maximum score for a non-final-answer or proof-style problem. grading_scheme (string): Rubric or… See the full description on the dataset page: https://huggingface.co/datasets/MathArena/usamo_2026.

πŸ“„ 2 papers⬇ 420πŸ’› 1πŸ€— HFcc-by-nc-sa-4.0
SkillsBenchEmerging

Warning: The leaderboard above is generated by Hugging Face eval-results and may be incomplete until evaluation_framework: benchflow is accepted and deployed. The audited SkillsBench v1.1 result archive is https://huggingface.co/datasets/benchflow/skillsbench-leaderboard, with compact official exports under leaderboard/skillsbench/v1.1/. Warning: The dataset is a read-only mirror. The primary source for this benchmark is on GitHub: https://github.com/benchflow-ai/skillsbench. Open issues and… See the full description on the dataset page: https://huggingface.co/datasets/benchflow/skillsbench.

πŸ“„ 2 papers⬇ 418πŸ’› 5πŸ€— HFapache-2.0
ManiSkillEmerging

ManiSkill Data ManiSkill is a unified benchmark for learning generalizable robotic manipulation skills powered by SAPIEN. It features 20 out-of-box task families with 2000+ diverse object models and 4M+ demonstration frames. Moreover, it empowers fast visual input learning algorithms so that a CNN-based policy can collect samples at about 2000 FPS with 1 GPU and 16 processes on a workstation. The benchmark can be used to study a wide range of algorithms: 2D & 3D vision-based… See the full description on the dataset page: https://huggingface.co/datasets/haosulab/ManiSkill.

πŸ“„ 2 papers⬇ 259πŸ’› 3πŸ€— HFapache-2.0
IMO 2025Emerging

IMO 2025 Problems Dataset This dataset contains the 6 problems from the 2025 International Mathematical Olympiad (IMO). The problems are formatted with proper LaTeX notation for mathematical expressions. Dataset Structure Each example contains: id: Problem identifier (e.g., "2025-imo-p1") problem: The problem statement with LaTeX mathematical notation solution: The solution (currently set to null) Problem Types The dataset includes problems covering various… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/imo-2025.

πŸ“„ 2 papers⬇ 107πŸ’› 2πŸ€— HFmit
FGVC-AircraftEmerging

The dataset contains 10,200 images of aircraft, with 100 images for each of 102 different aircraft model variants, most of which are airplanes. The (main) aircraft in each image is annotated with a tight bounding box and a hierarchical airplane model label. Aircraft models are organized in a four-levels hierarchy. The four levels, from finer to coarser, are: Model, e.g. Boeing 737-76J. Since certain models are nearly visually indistinguishable, this level is not used in the evaluation. Variant, e.g. Boeing 737-700. A variant collapses all the models that are visually indistinguishable into one class. The dataset comprises 102 different variants. Family, e.g. Boeing 737. The dataset comprises 70 different families. Manufacturer, e.g. Boeing. The dataset comprises 41 different manufacturers. The data is divided into three equally-sized training, validation and test subsets. The first two sets can be used for development, and the latter should be used for final evaluation only.

πŸ“„ 2 papers⬇ 66πŸ’› 1πŸ€— HF
BFCLEmerging
πŸ“„ 2 papers
C. elegansEmerging
πŸ“„ 2 papers
MIMIC-IVEmerging
πŸ“„ 2 papers
SWE-bench VerifiedCanonical

Dataset Summary SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process. The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The original… See the full description on the dataset page: https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified.

πŸ“„ 1 paper⬇ 69.6kπŸ’› 95πŸ€— HF
AgentBenchCanonical
πŸ“„ 1 paper⬇ 152πŸ’› 1πŸ€— HF
13,441,191 MRI slicesEmerging
πŸ“„ 1 paper
21 breast cancer patient profilesEmerging
πŸ“„ 1 paper
26B-A4B Mixture-of-ExpertsEmerging
πŸ“„ 1 paper
39,300 Capella patchesEmerging
πŸ“„ 1 paper
3 face verification benchmarksEmerging
πŸ“„ 1 paper
410-task benchmarkEmerging
πŸ“„ 1 paper
453,683 MRI seriesEmerging
πŸ“„ 1 paper
736 real-worldEmerging
πŸ“„ 1 paper
8,100 force-closure graspsEmerging
πŸ“„ 1 paper
81 objectsEmerging
πŸ“„ 1 paper
823 clinical casesEmerging
πŸ“„ 1 paper
87 syntheticEmerging
πŸ“„ 1 paper
8-dimensional q-ary latticesEmerging
πŸ“„ 1 paper
AARRI-BenchEmerging
πŸ“„ 1 paper
ADE20KEmerging
πŸ“„ 1 paper
adversarial dataset of 103 clinical MCQsEmerging
πŸ“„ 1 paper
AgentFairBenchEmerging
πŸ“„ 1 paper
AIDevEmerging
πŸ“„ 1 paper
AIMEEmerging
πŸ“„ 1 paper
AIME 24/25Emerging
πŸ“„ 1 paper
AIPREmerging
πŸ“„ 1 paper
AI Safety GridworldsEmerging
πŸ“„ 1 paper
ALMANACEmerging
πŸ“„ 1 paper
Alpaca-EvalEmerging
πŸ“„ 1 paper
AMC 23Emerging
πŸ“„ 1 paper
anatase TiO2(101)-water interfaceEmerging
πŸ“„ 1 paper
ArabicEmerging
πŸ“„ 1 paper
ATOM-BenchEmerging
πŸ“„ 1 paper
AuAuEmerging
πŸ“„ 1 paper
Audio MultiChallengeEmerging
πŸ“„ 1 paper
Audio-Visual Conversation CorpusEmerging
πŸ“„ 1 paper
BAGELEmerging
πŸ“„ 1 paper
BIG-benchEmerging
πŸ“„ 1 paper
BugsInPyEmerging
πŸ“„ 1 paper
BurstGPTEmerging
πŸ“„ 1 paper
CALVINEmerging
πŸ“„ 1 paper
cardiovascular disease (CVD)Emerging
πŸ“„ 1 paper
Cars-196Emerging
πŸ“„ 1 paper
CFPEmerging
πŸ“„ 1 paper
CHAIREmerging
πŸ“„ 1 paper
CHILLGuardTestEmerging
πŸ“„ 1 paper
CHILLGuardTrainEmerging
πŸ“„ 1 paper
ChineseEmerging
πŸ“„ 1 paper
ChronoIDEmerging
πŸ“„ 1 paper
CIFAR-100Emerging
πŸ“„ 1 paper