πŸ“Š Datasets β€” Awesome Reinforcement Learning

297 datasets & benchmarks β€” 13 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.

297 of 297 datasets
CIFAR100Emerging

Dataset Card for CIFAR-100 Dataset Summary The CIFAR-100 dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class. There are 500 training images and 100 testing images per class. There are 50000 training images and 10000 test images. The 100 classes are grouped into 20 superclasses. There are two labels per image - fine label (actual class) and coarse label (superclass). Supported Tasks and Leaderboards image-classification: The… See the full description on the dataset page: https://huggingface.co/datasets/uoft-cs/cifar100.

πŸ“„ 4 papers⬇ 38.2kπŸ’› 63πŸ€— HFunknown
LIBEROEmerging

This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v3.0", "robot_type": "panda", "total_episodes": 1693, "total_frames": 273465, "total_tasks": 40, "chunks_size": 1000, "fps": 10.0, "splits": { "train": "0:1693" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path": "videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4"… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceVLA/libero.

πŸ“„ 4 papers⬇ 23.2kπŸ’› 60πŸ€— HFapache-2.0
AlfWorldEmerging
πŸ“„ 4 papers⬇ 19πŸ€— HF
WebShopEmerging
πŸ“„ 4 papers
GSM8KEmerging

Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ βˆ’ Γ—Γ·) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

πŸ“„ 3 papers⬇ 895.3kπŸ’› 1.4kπŸ€— HFmit
LongMemEval-SEmerging
πŸ“„ 3 papers⬇ 5πŸ€— HF
MBPPEmerging

Dataset Card for Mostly Basic Python Problems (mbpp) Dataset Summary The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us. Released here as part of… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/mbpp.

πŸ“„ 2 papers⬇ 183.8kπŸ’› 230πŸ€— HFcc-by-4.0
CIFAR10Emerging

Dataset Card for CIFAR-10 Dataset Summary The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain… See the full description on the dataset page: https://huggingface.co/datasets/uoft-cs/cifar10.

πŸ“„ 2 papers⬇ 163.0kπŸ’› 106πŸ€— HFunknown
SWE-Bench ProEmerging

Dataset Summary SWE-Bench Pro is a challenging, enterprise-level dataset for testing agent ability on long-horizon software engineering tasks. Paper: https://static.scale.com/uploads/654197dc94d34f66c0f5184e/SWEAP_Eval_Scale%20(9).pdf See the related evaluation Github: https://github.com/scaleapi/SWE-bench_Pro-os Dataset Structure We follow SWE-Bench Verified (https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified) in terms of dataset structure, with several… See the full description on the dataset page: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro.

πŸ“„ 2 papers⬇ 72.7kπŸ’› 127πŸ€— HF
GAIAEmerging

GAIA dataset GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). We added gating to prevent bots from scraping the dataset. Please do not reshare the validation or test set in a crawlable format. Data and leaderboard GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to… See the full description on the dataset page: https://huggingface.co/datasets/gaia-benchmark/GAIA.

πŸ“„ 2 papers⬇ 42.2kπŸ’› 692πŸ€— HF
WMDPEmerging

Dataset Card for WMDP The Weapons of Mass Destruction Proxy (WMDP) benchmark is a dataset of multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. See our paper, website, and GitHub for more details! We implemented the WMDP evaluation in… See the full description on the dataset page: https://huggingface.co/datasets/cais/wmdp.

πŸ“„ 2 papers⬇ 25.6kπŸ’› 27πŸ€— HFmit
HotpotQAEmerging

Dataset Card for BEIR Benchmark hotpotqa is one of the datasets from the Question Answering task within BEIR, measuring Wikipedia article retrieval for a given multi-hop query. Dataset Summary BEIR is a heterogeneous benchmark built from 18 diverse datasets representing 9 information retrieval tasks. Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/hotpotqa.

πŸ“„ 2 papers⬇ 1.4kπŸ’› 16πŸ€— HFcc-by-sa-4.0
LOCOMOEmerging
πŸ“„ 2 papers⬇ 250πŸ’› 3πŸ€— HF
MATH500Emerging

https://github.com/openai/prm800k/blob/main/prm800k/math_splits/test.jsonl

πŸ“„ 2 papers⬇ 188πŸ’› 9πŸ€— HF
LibriSpeechEmerging

Language modeling resources to be used in conjunction with the LibriSpeech ASR corpus.

πŸ“„ 2 papers⬇ 181πŸ’› 1πŸ€— HFcc0-1.0
IFBenchEmerging

IFBench: Dataset for evaluating instruction-following reward models This repository contains the data of the paper "Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems" Paper: https://arxiv.org/abs/2502.19328 GitHub: https://github.com/THU-KEG/Agentic-Reward-Modeling Dataset Details the samples are formatted as follows: { "id": // unique identifier of the sample, "source": // source… See the full description on the dataset page: https://huggingface.co/datasets/THU-KEG/IFBench.

πŸ“„ 2 papers⬇ 174πŸ’› 10πŸ€— HFapache-2.0
ASVspoof 5Emerging
πŸ“„ 2 papers
Bash ArenaEmerging
πŸ“„ 2 papers
C. elegansEmerging
πŸ“„ 2 papers
GPT-2Emerging
πŸ“„ 2 papers
MT-BenchEmerging
πŸ“„ 2 papers
PG-19Emerging
πŸ“„ 2 papers
SWE-benchEmerging
πŸ“„ 2 papers
WikiText-103Emerging
πŸ“„ 2 papers
WorkBenchEmerging
πŸ“„ 2 papers
D4RLCanonical

D4RL Dataset on HuggingFace This repository hosts the pre-downloaded D4RL dataset on HuggingFace. It is designed to provide accelerated data downloading for users, eliminating the need to download the dataset from scratch. Installation To use this dataset, you need to clone it into your local .d4rl directory. Here are the steps to do so: Navigate to your .d4rl directory: cd ~/.d4rl Clone the dataset repository from HuggingFace: git clone… See the full description on the dataset page: https://huggingface.co/datasets/imone/D4RL.

πŸ“„ 1 paper⬇ 2.9kπŸ’› 4πŸ€— HFapache-2.0
1,083 valid takesEmerging
πŸ“„ 1 paper
10 popular LLMsEmerging
πŸ“„ 1 paper
2026 SoccerNet VQA ChallengeEmerging
πŸ“„ 1 paper
26B-A4B Mixture-of-ExpertsEmerging
πŸ“„ 1 paper
2D dynamic controlEmerging
πŸ“„ 1 paper
2WikiMultiHopQAEmerging
πŸ“„ 1 paper
50-task hotel expense benchmarkEmerging
πŸ“„ 1 paper
7,200 image datasetEmerging
πŸ“„ 1 paper
8,100 force-closure graspsEmerging
πŸ“„ 1 paper
81 objectsEmerging
πŸ“„ 1 paper
89 containerized tasksEmerging
πŸ“„ 1 paper
ABC-BenchEmerging
πŸ“„ 1 paper
ACM Multimedia AVI Challenge 2026Emerging
πŸ“„ 1 paper
ActiveGraphEmerging
πŸ“„ 1 paper
ActivityNet-QAEmerging
πŸ“„ 1 paper
AgileXEmerging
πŸ“„ 1 paper
AIME 2025Emerging
πŸ“„ 1 paper
AIME'25Emerging
πŸ“„ 1 paper
AISEmerging
πŸ“„ 1 paper
AlpacaEval 2.0Emerging
πŸ“„ 1 paper
AMCEmerging
πŸ“„ 1 paper
AntPlan-270Emerging
πŸ“„ 1 paper
A-OKVQAEmerging
πŸ“„ 1 paper
AppWorldEmerging
πŸ“„ 1 paper
ARC-AGI-1Emerging
πŸ“„ 1 paper
Arena-Hard-v0.1Emerging
πŸ“„ 1 paper
ASVspoof 2021Emerging
πŸ“„ 1 paper
atrial fibrillation (AF)Emerging
πŸ“„ 1 paper
atrial flutter (AFLT)Emerging
πŸ“„ 1 paper
AutoLabEmerging
πŸ“„ 1 paper
BanditEmerging
πŸ“„ 1 paper
BCI-IV-2aEmerging
πŸ“„ 1 paper
BDD100KEmerging
πŸ“„ 1 paper
BETAEmerging
πŸ“„ 1 paper