π Datasets β Awesome AI Agents
317 datasets & benchmarks β 18 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.
Dataset Card for MNIST Dataset Summary The MNIST dataset consists of 70,000 28x28 black-and-white images of handwritten digits extracted from two NIST databases. There are 60,000 images in the training dataset and 10,000 images in the validation dataset, one class per digit so a total of 10 classes, with 7,000 images (6,000 train images and 1,000 test images) per class. Half of the image were drawn by Census Bureau employees and the other half by high school students⦠See the full description on the dataset page: https://huggingface.co/datasets/ylecun/mnist.
GAIA dataset GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). We added gating to prevent bots from scraping the dataset. Please do not reshare the validation or test set in a crawlable format. Data and leaderboard GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to⦠See the full description on the dataset page: https://huggingface.co/datasets/gaia-benchmark/GAIA.
This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v3.0", "robot_type": "panda", "total_episodes": 1693, "total_frames": 273465, "total_tasks": 40, "chunks_size": 1000, "fps": 10.0, "splits": { "train": "0:1693" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path": "videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4"β¦ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceVLA/libero.
Homepage and repository Homepage: https://matharena.ai/ Repository: https://github.com/eth-sri/matharena Dataset Summary This dataset contains the questions from USAMO 2026 used for the MathArena Leaderboard Data Fields The dataset contains the following fields: problem_idx (int64): Problem index within the corresponding MathArena benchmark. points (int64): Maximum score for a non-final-answer or proof-style problem. grading_scheme (string): Rubric or⦠See the full description on the dataset page: https://huggingface.co/datasets/MathArena/usamo_2026.
Warning: The leaderboard above is generated by Hugging Face eval-results and may be incomplete until evaluation_framework: benchflow is accepted and deployed. The audited SkillsBench v1.1 result archive is https://huggingface.co/datasets/benchflow/skillsbench-leaderboard, with compact official exports under leaderboard/skillsbench/v1.1/. Warning: The dataset is a read-only mirror. The primary source for this benchmark is on GitHub: https://github.com/benchflow-ai/skillsbench. Open issues and⦠See the full description on the dataset page: https://huggingface.co/datasets/benchflow/skillsbench.
ManiSkill Data ManiSkill is a unified benchmark for learning generalizable robotic manipulation skills powered by SAPIEN. It features 20 out-of-box task families with 2000+ diverse object models and 4M+ demonstration frames. Moreover, it empowers fast visual input learning algorithms so that a CNN-based policy can collect samples at about 2000 FPS with 1 GPU and 16 processes on a workstation. The benchmark can be used to study a wide range of algorithms: 2D & 3D vision-based⦠See the full description on the dataset page: https://huggingface.co/datasets/haosulab/ManiSkill.
IMO 2025 Problems Dataset This dataset contains the 6 problems from the 2025 International Mathematical Olympiad (IMO). The problems are formatted with proper LaTeX notation for mathematical expressions. Dataset Structure Each example contains: id: Problem identifier (e.g., "2025-imo-p1") problem: The problem statement with LaTeX mathematical notation solution: The solution (currently set to null) Problem Types The dataset includes problems covering various⦠See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/imo-2025.
The dataset contains 10,200 images of aircraft, with 100 images for each of 102 different aircraft model variants, most of which are airplanes. The (main) aircraft in each image is annotated with a tight bounding box and a hierarchical airplane model label. Aircraft models are organized in a four-levels hierarchy. The four levels, from finer to coarser, are: Model, e.g. Boeing 737-76J. Since certain models are nearly visually indistinguishable, this level is not used in the evaluation. Variant, e.g. Boeing 737-700. A variant collapses all the models that are visually indistinguishable into one class. The dataset comprises 102 different variants. Family, e.g. Boeing 737. The dataset comprises 70 different families. Manufacturer, e.g. Boeing. The dataset comprises 41 different manufacturers. The data is divided into three equally-sized training, validation and test subsets. The first two sets can be used for development, and the latter should be used for final evaluation only.
Dataset Summary SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systemsβ ability to solve GitHub issues automatically. See this post for more details on the human-validation process. The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The originalβ¦ See the full description on the dataset page: https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified.