π Datasets β Awesome Generative Models
358 datasets & benchmarks β 16 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.
Dataset Summary SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systemsβ ability to solve GitHub issues automatically. See this post for more details on the human-validation process. The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The originalβ¦ See the full description on the dataset page: https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified.
Dataset Card Creation Guide Dataset Summary Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering Supported Tasks and Leaderboards Multi-modal Multiple Choice Languages English Dataset Structure Data Instances Explore more samples here. {'image': Image, 'question': 'Which of these states is farthest north?', 'choices': ['West Virginia', 'Louisiana', 'Arizona', 'Oklahoma'], 'answer': 0β¦ See the full description on the dataset page: https://huggingface.co/datasets/derek-thomas/ScienceQA.
HumanEval-X is a benchmark for the evaluation of the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks.
https://github.com/openai/prm800k/blob/main/prm800k/math_splits/test.jsonl
Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ β ΓΓ·) to reach theβ¦ See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.
Dataset Card for Mostly Basic Python Problems (mbpp) Dataset Summary The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us. Released here as part of⦠See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/mbpp.
GAIA dataset GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). We added gating to prevent bots from scraping the dataset. Please do not reshare the validation or test set in a crawlable format. Data and leaderboard GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to⦠See the full description on the dataset page: https://huggingface.co/datasets/gaia-benchmark/GAIA.
Dataset Card for Kitti The Kitti dataset. The Kitti object detection and object orientation estimation benchmark consists of 7481 training images and 7518 test images, comprising a total of 80.256 labeled objects
Hub version of the T2I-CompBench dataset. All credits and licensing belong to the creators of the dataset. This version was obtained as described below. First, the ".txt" files were obtained from this directory. Code import requests import os # Set the necessary parameters owner = "Karine-Huang" repo = "T2I-CompBench" branch = "main" directory = "examples/dataset" local_directory = "." # GitHub API URL to get contents of the directoryurl =β¦ See the full description on the dataset page: https://huggingface.co/datasets/NinaKarine/t2i-compbench.