📊 Datasets — Awesome Reinforcement Learning

2,737 datasets & benchmarks — 13 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.

MuJoCoCanonical

A physics engine whose continuous-control locomotion tasks are standard RL benchmarks.

📄 156 papers

D4RLCanonical

Datasets for Deep Data-Driven RL — standardized offline-RL benchmark tasks and logged datasets.

📄 125 papers⬇ 3.4k💛 4🤗 HFapache-2.0

DeepMind Control SuiteCanonical

A set of continuous-control RL tasks built on MuJoCo with a standardized interface.

📄 66 papers

GSM8KEmerging

GSM-8K is a benchmark dataset that contains problems requiring single-step reasoning, used to evaluate the performance of large language model agents in decision-making tasks.

📄 65 papers

AtariCanonical

The Arcade Learning Environment — Atari 2600 games used as a standard deep-reinforcement-learning benchmark.

📄 64 papers

Meta-WorldCanonical

A benchmark of 50 robotic manipulation tasks for multi-task and meta-reinforcement learning.

📄 62 papers

MATH-500Emerging

Dataset Card for MATH-500 This dataset contains a subset of 500 problems from the MATH benchmark that OpenAI created in their Let's Verify Step by Step paper. See their GitHub repo for the source file: https://github.com/openai/prm800k/tree/main?tab=readme-ov-file#math-splits

📄 61 papers⬇ 168.2k💛 318🤗 HF

ALFWorldEmerging

ALFWorld is a dataset/benchmark that evaluates the performance of large language models (LLMs) as autonomous agents in multi-turn tasks by providing a structured environment for reinforcement learning.

📄 55 papers⬇ 24🤗 HF

StarCraft IICanonical

'StarCraft II' is a benchmark dataset used to evaluate multi-agent reinforcement learning (MARL) systems by providing complex scenarios that require coordinated behavior among agents.

📄 48 papers

MATHEmerging

The 'MATH' dataset is a benchmark used to evaluate the mathematical reasoning capabilities of models.

📄 45 papers

AIME-24Emerging

AIME-24 is a benchmark dataset used to evaluate reinforcement learning with verifiable rewards (RLVR) in the context of solving challenging math questions.

📄 42 papers

OpenAI GymCanonical

OpenAI Gym is a toolkit that provides a variety of environments for developing and evaluating reinforcement learning algorithms, particularly in continuous control tasks.

📄 38 papers

WebShopEmerging

The 'WebShop' dataset/benchmark contains a collection of multi-turn interactions designed to evaluate the performance of reinforcement learning algorithms in training large language models as autonomous agents.

📄 38 papers

OGBenchEmerging

OgBench: Benchmarking Graph Neural Networks on Omics Data OgBench is the first benchmark suite for graph-level prediction in the n ≪ p regime characteristic of omics data, where the number of patient samples n is much smaller than the number of nodes (genes or proteins) p per graph. Datasets This repository contains four preprocessed omics graph classification datasets: Dataset Modality n p Task HERITAGE Proteomics 654 4,977 Exercise responder… See the full description on the dataset page: https://huggingface.co/datasets/geometric-intelligence/ogbench.

📄 37 papers⬇ 285🤗 HFcc-by-4.0

CARLAEmerging

CARLA is a dataset and benchmark used to evaluate reinforcement learning methods for autonomous driving, containing simulated environments that allow for safe exploration and learning in complex driving scenarios.

📄 37 papers⬇ 7🤗 HFother

AIMEEmerging

The AIME dataset/benchmark contains mathematical reasoning tasks used to evaluate the performance of large language models in generating correct solutions and intermediate reasoning steps.

📄 33 papers⬇ 171🤗 HF

AIME 2024Emerging

AIME 2024 Dataset Dataset Description This dataset contains problems from the American Invitational Mathematics Examination (AIME) 2024. AIME is a prestigious high school mathematics competition known for its challenging mathematical problems. Dataset Details Format: JSONL Size: 30 records Source: AIME 2024 I & II Language: English Data Fields Each record contains the following fields: ID: Problem identifier (e.g., "2024-I-1" represents Problem 1… See the full description on the dataset page: https://huggingface.co/datasets/Maxwell-Jia/AIME_2024.

📄 32 papers⬇ 30.5k💛 85🤗 HFmit

MiniGridEmerging

MiniGrid is a benchmark that contains a set of discrete tasks used to evaluate exploration strategies in reinforcement learning.

📄 32 papers⬇ 13.3k💛 1🤗 HF

CartPoleEmerging

The 'CartPole' is a benchmark environment used in reinforcement learning that involves balancing a pole on a moving cart, and it is utilized to evaluate the performance of RL algorithms.

📄 27 papers⬇ 71🤗 HFmit

AIME-25Emerging

AIME-25 is a benchmark dataset used to evaluate reinforcement learning with verifiable rewards (RLVR) in the context of solving challenging math questions.

📄 26 papers

Atari gamesEmerging

Atari games is a benchmark dataset containing a collection of video games used to evaluate the performance of reinforcement learning algorithms.

📄 25 papers⬇ 48🤗 HF

AIME 2025Emerging

Homepage and repository Homepage: https://matharena.ai/ Repository: https://github.com/eth-sri/matharena Dataset Summary This dataset contains the questions from AIME 2025 used for the MathArena Leaderboard Data Fields The dataset contains the following fields: problem_idx (int64): Problem index within the corresponding MathArena benchmark. problem (string): Problem statement, usually stored as LaTeX source. answer (int64): Gold final answer. problem_type… See the full description on the dataset page: https://huggingface.co/datasets/MathArena/aime_2025.

📄 23 papers⬇ 32.1k💛 16🤗 HFcc-by-nc-sa-4.0

LiveCodeBenchEmerging

'LiveCodeBench' is a benchmark dataset used to evaluate the performance of reasoning models in code generation tasks.

📄 22 papers

AMCEmerging

The 'AMC' dataset/benchmark contains a collection of mathematical reasoning problems used to evaluate the performance of large language models in solving multi-step mathematical tasks.

📄 21 papers⬇ 34🤗 HF

Atari 100KEmerging

Atari 100K is a benchmark dataset used to evaluate the performance of reinforcement learning algorithms on a selection of Atari 2600 games, focusing on efficiency and control capabilities.

📄 21 papers

GridworldEmerging

Gridworld is a benchmark used to evaluate reinforcement learning agents' ability to adapt to changing action spaces and reward functions in a controlled environment.

📄 21 papers

GPQAEmerging

Dataset Card for GPQA GPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google. We request that you do not reveal examples from this dataset in plain text or images online, to reduce the risk of leakage into foundation… See the full description on the dataset page: https://huggingface.co/datasets/Idavidrein/gpqa.

📄 20 papers⬇ 99.4k💛 492🤗 HFcc-by-4.0

HumanEvalEmerging

HumanEval-X is a benchmark for the evaluation of the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks.

📄 20 papers⬇ 4.0k💛 96🤗 HFapache-2.0

MPEEmerging

Overview Original dataset here. Dataset curation Same data and splits as the original. The following columns have been added: premise: concatenation of premise1, premise2, premise3, and premise4 label: encoded gold_label with the following mapping {"entailment": 0, "neutral": 1, "contradiction": 2} Code to create the dataset import pandas as pd from datasets import Features, Value, ClassLabel, Dataset, DatasetDict from pathlib import Path # read data path =… See the full description on the dataset page: https://huggingface.co/datasets/pietrolesci/mpe.

📄 19 papers⬇ 30🤗 HF

Google Research FootballCanonical

'Google Research Football' is a benchmark that contains a simulated football environment used to evaluate cooperative multi-agent reinforcement learning (MARL) algorithms in complex, dynamic settings.

📄 19 papers

StarCraft Multi-Agent ChallengeEmerging

The StarCraft Multi-Agent Challenge is a benchmark that contains diverse multi-agent environments used to evaluate the performance of reinforcement learning models across various tasks.

📄 18 papers

MMLU-ProEmerging

MMLU-Pro Dataset MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines. |Github | 🏆Leaderboard | 📖Paper | 🚀 What's New [2026.03.11] Added more cutting-edge frontier models to the leaderboard, including the Claude-4.6 series, Seed2.0 series, Qwen3.5 series, and Gemini-3.1-Pro… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro.

📄 17 papers⬇ 159.1k💛 504🤗 HFmit

AntMazeEmerging

AntMaze is a benchmark within the D4RL continuous-control dataset that is used to evaluate offline reinforcement learning algorithms by providing a set of maze navigation tasks for agents.

📄 17 papers

SMAC-v-2Emerging

SMAC-v-2 is a benchmark dataset used to evaluate offline multi-agent reinforcement learning algorithms, containing diverse multi-task scenarios with varying numbers of agents.

📄 17 papers

CrafterCanonical

The 'Crafter' dataset/benchmark contains high-dimensional observation spaces used to evaluate model-based reinforcement learning agents' ability to learn abstract representations for effective planning and control.

📄 16 papers⬇ 12🤗 HF

Multi-Agent MuJoCoEmerging

'Multi-Agent MuJoCo' is a benchmark that contains various cooperative multi-agent environments designed to evaluate the performance and coordination of algorithms in addressing challenges associated with large joint observation and action spaces in reinforcement learning.

📄 16 papers

OvercookedEmerging

The 'Overcooked' dataset/benchmark is an environment used to evaluate zero-shot coordination performance among multi-agent systems, focusing on the ability of agents to adapt to unseen teammates.

📄 16 papers

SUMOEmerging

SUMO is a traffic simulation benchmark that contains scenarios for evaluating traffic signal control strategies, focusing on metrics such as queue percentiles and delay trends.

📄 16 papers

GPQA-DiamondEmerging

The 'GPQA-Diamond' dataset/benchmark contains evaluation criteria for assessing the performance of language models in various tasks, specifically focusing on the quality of generated outputs in areas like science, medicine, instruction following, and creative writing.

📄 15 papers⬇ 1.9k💛 10🤗 HF

OlympiadBenchEmerging

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems[ACL 2024] 📖 arXiv | GitHub Note: We have made adjustments to the image content in the multimodal portion of the dataset and fixed previous issues where some images in the English physics subset were not displayed properly. If your usage involves images, please re-download the dataset (we recommend all users to download the latest version). Additionally, some entries… See the full description on the dataset page: https://huggingface.co/datasets/Hothan/OlympiadBench.

📄 15 papers⬇ 1.5k💛 5🤗 HF

ProcgenCanonical

Procgen Benchmark This dataset contains expert trajectories generated by a PPO reinforcement learning agent trained on each of the 16 procedurally-generated gym environments from the Procgen Benchmark. The environments were created on distribution_mode=easy and with unlimited levels. Disclaimer: This is not an official repository from OpenAI. Dataset Usage Regular usage (for environment bigfish): from datasets import load_dataset train_dataset =… See the full description on the dataset page: https://huggingface.co/datasets/EpicPinkPenguin/procgen.

📄 15 papers⬇ 389💛 6🤗 HFapache-2.0

AndroidWorldEmerging

AndroidWorld is a dataset/benchmark used to evaluate the performance of Android agents in reinforcement learning tasks.

📄 15 papers

RewardBenchEmerging

RewardBench is a benchmark dataset used to evaluate the performance of Generative Reward Models (GRMs) in reward modeling by providing a set of preference data for training and assessing pointwise reward predictions.

📄 15 papers

SWE-bench VerifiedEmerging

Dataset Summary SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process. The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The original… See the full description on the dataset page: https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified.

📄 14 papers⬇ 77.7k💛 112🤗 HF

MMLUEmerging

Dataset Card for MMLU Dataset Summary Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.

📄 13 papers⬇ 461.2k💛 803🤗 HFmit

MBPPEmerging

Dataset Card for Mostly Basic Python Problems (mbpp) Dataset Summary The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us. Released here as part of… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/mbpp.

📄 13 papers⬇ 169.0k💛 233🤗 HFcc-by-4.0

Arena-HardEmerging

The 'Arena-Hard' dataset is a benchmark used to evaluate the performance of LLMs in alignment tasks by providing challenging scenarios that require reasoning and decision-making without verifiable ground-truth verifiers.

📄 13 papers⬇ 84💛 1🤗 HF

SokobanEmerging

The 'Sokoban' dataset/benchmark contains a series of puzzle scenarios used to evaluate the performance of reinforcement learning agents in solving multi-turn interactive tasks.

📄 13 papers⬇ 15🤗 HF

FrozenLakeEmerging

Dataset Card for [FrozenLake-v1]

📄 13 papers⬇ 4🤗 HF

AMC-23Emerging

The 'AMC-23' dataset/benchmark is used to evaluate the performance of large language models in reasoning tasks.

📄 13 papers

Lunar LanderEmerging

The 'Lunar Lander' is a contextual reinforcement learning environment used to evaluate the efficiency and effectiveness of reinforcement learning algorithms, particularly in the context of curriculum learning.

📄 13 papers

ScienceWorldEmerging

ScienceWorld is a benchmark dataset used to evaluate the performance and generalization of large language model agents in complex interactive decision-making tasks.

📄 13 papers

LIBEROEmerging

This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v3.0", "robot_type": "panda", "total_episodes": 1693, "total_frames": 273465, "total_tasks": 40, "chunks_size": 1000, "fps": 10.0, "splits": { "train": "0:1693" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path": "videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4"… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceVLA/libero.

📄 12 papers⬇ 28.2k💛 68🤗 HFapache-2.0

GAIAEmerging

GAIA dataset GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). We added gating to prevent bots from scraping the dataset. Please do not reshare the validation or test set in a crawlable format. Data and leaderboard GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to… See the full description on the dataset page: https://huggingface.co/datasets/gaia-benchmark/GAIA.

📄 12 papers⬇ 16.5k💛 736🤗 HF

AlpacaEval 2Emerging

'AlpacaEval 2' is a benchmark dataset used to evaluate the performance of large language models in open-domain tasks through various assessment metrics.

📄 12 papers

HumanoidBenchEmerging

HumanoidBench is a challenging whole-body control benchmark used to evaluate the performance of reinforcement learning algorithms in mastering diverse humanoid tasks.

📄 12 papers

IsaacLabEmerging

The 'IsaacLab' dataset/benchmark contains a variety of simulated robotic environments and is used to evaluate the performance of reinforcement learning algorithms in complex, real-world tasks.

📄 12 papers

MT-BenchEmerging

The 'MT-Bench' dataset is a benchmark used to evaluate instruction-following capabilities of large language models through preference data.

📄 12 papers

Multi-Agent Particle Environment (MPE)Emerging

The Multi-Agent Particle Environment (MPE) is a benchmark that contains cooperative scenarios used to evaluate multi-agent reinforcement learning (MARL) algorithms and their ability to coordinate among agents.

📄 12 papers

RoboMimicEmerging

Robomimic is a dataset/benchmark that contains challenging robotic manipulation tasks used to evaluate the performance of reinforcement learning policies.

📄 12 papers

Loading datasets…