π Datasets β Awesome Reinforcement Learning
297 datasets & benchmarks β 13 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.
Dataset Card for CIFAR-100 Dataset Summary The CIFAR-100 dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class. There are 500 training images and 100 testing images per class. There are 50000 training images and 10000 test images. The 100 classes are grouped into 20 superclasses. There are two labels per image - fine label (actual class) and coarse label (superclass). Supported Tasks and Leaderboards image-classification: The⦠See the full description on the dataset page: https://huggingface.co/datasets/uoft-cs/cifar100.
This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v3.0", "robot_type": "panda", "total_episodes": 1693, "total_frames": 273465, "total_tasks": 40, "chunks_size": 1000, "fps": 10.0, "splits": { "train": "0:1693" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path": "videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4"β¦ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceVLA/libero.
Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ β ΓΓ·) to reach theβ¦ See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.
Dataset Card for Mostly Basic Python Problems (mbpp) Dataset Summary The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us. Released here as part of⦠See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/mbpp.
Dataset Card for CIFAR-10 Dataset Summary The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain⦠See the full description on the dataset page: https://huggingface.co/datasets/uoft-cs/cifar10.
Dataset Summary SWE-Bench Pro is a challenging, enterprise-level dataset for testing agent ability on long-horizon software engineering tasks. Paper: https://static.scale.com/uploads/654197dc94d34f66c0f5184e/SWEAP_Eval_Scale%20(9).pdf See the related evaluation Github: https://github.com/scaleapi/SWE-bench_Pro-os Dataset Structure We follow SWE-Bench Verified (https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified) in terms of dataset structure, with several⦠See the full description on the dataset page: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro.
GAIA dataset GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). We added gating to prevent bots from scraping the dataset. Please do not reshare the validation or test set in a crawlable format. Data and leaderboard GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to⦠See the full description on the dataset page: https://huggingface.co/datasets/gaia-benchmark/GAIA.
Dataset Card for WMDP The Weapons of Mass Destruction Proxy (WMDP) benchmark is a dataset of multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. See our paper, website, and GitHub for more details! We implemented the WMDP evaluation in⦠See the full description on the dataset page: https://huggingface.co/datasets/cais/wmdp.
Dataset Card for BEIR Benchmark hotpotqa is one of the datasets from the Question Answering task within BEIR, measuring Wikipedia article retrieval for a given multi-hop query. Dataset Summary BEIR is a heterogeneous benchmark built from 18 diverse datasets representing 9 information retrieval tasks. Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04β¦ See the full description on the dataset page: https://huggingface.co/datasets/BeIR/hotpotqa.
https://github.com/openai/prm800k/blob/main/prm800k/math_splits/test.jsonl
Language modeling resources to be used in conjunction with the LibriSpeech ASR corpus.
IFBench: Dataset for evaluating instruction-following reward models This repository contains the data of the paper "Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems" Paper: https://arxiv.org/abs/2502.19328 GitHub: https://github.com/THU-KEG/Agentic-Reward-Modeling Dataset Details the samples are formatted as follows: { "id": // unique identifier of the sample, "source": // source⦠See the full description on the dataset page: https://huggingface.co/datasets/THU-KEG/IFBench.
D4RL Dataset on HuggingFace This repository hosts the pre-downloaded D4RL dataset on HuggingFace. It is designed to provide accelerated data downloading for users, eliminating the need to download the dataset from scratch. Installation To use this dataset, you need to clone it into your local .d4rl directory. Here are the steps to do so: Navigate to your .d4rl directory: cd ~/.d4rl Clone the dataset repository from HuggingFace: git clone⦠See the full description on the dataset page: https://huggingface.co/datasets/imone/D4RL.