πŸ“Š Datasets β€” Awesome Multimodal

401 datasets & benchmarks β€” 20 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.

401 of 401 datasets
GSM8KEmerging

Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ βˆ’ Γ—Γ·) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

πŸ“„ 3 papers⬇ 895.3kπŸ’› 1.4kπŸ€— HFmit
LIBEROEmerging

This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v3.0", "robot_type": "panda", "total_episodes": 1693, "total_frames": 273465, "total_tasks": 40, "chunks_size": 1000, "fps": 10.0, "splits": { "train": "0:1693" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path": "videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4"… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceVLA/libero.

πŸ“„ 3 papers⬇ 23.2kπŸ’› 60πŸ€— HFapache-2.0
HumanEvalEmerging

HumanEval-X is a benchmark for the evaluation of the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks.

πŸ“„ 3 papers⬇ 1.8kπŸ’› 95πŸ€— HFapache-2.0
CIFAR-10Emerging
πŸ“„ 3 papers⬇ 1.7kπŸ€— HF
ADE20KEmerging
πŸ“„ 3 papers
CIFAR-100Emerging
πŸ“„ 3 papers
MBPPEmerging

Dataset Card for Mostly Basic Python Problems (mbpp) Dataset Summary The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us. Released here as part of… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/mbpp.

πŸ“„ 2 papers⬇ 183.8kπŸ’› 230πŸ€— HFcc-by-4.0
MNISTEmerging

Dataset Card for MNIST Dataset Summary The MNIST dataset consists of 70,000 28x28 black-and-white images of handwritten digits extracted from two NIST databases. There are 60,000 images in the training dataset and 10,000 images in the validation dataset, one class per digit so a total of 10 classes, with 7,000 images (6,000 train images and 1,000 test images) per class. Half of the image were drawn by Census Bureau employees and the other half by high school students… See the full description on the dataset page: https://huggingface.co/datasets/ylecun/mnist.

πŸ“„ 2 papers⬇ 149.3kπŸ’› 248πŸ€— HFmit
BrowseComp-PlusEmerging

BrowseComp-Plus BrowseComp-Plus is a new benchmark for Deep-Research system, isolating the effect of the retriever and the LLM agent to enable fair, transparent comparisons of Deep-Research agents. The benchmark sources challenging, reasoning-intensive queries from OpenAI's BrowseComp. However, instead of searching the live web, BrowseComp-Plus evaluates against a fixed, curated corpus of ~100K web documents from the web. The corpus includes both human-verified evidence documents… See the full description on the dataset page: https://huggingface.co/datasets/Tevatron/browsecomp-plus.

πŸ“„ 2 papers⬇ 30.9kπŸ’› 35πŸ€— HFmit
ScienceQACanonical

Dataset Card Creation Guide Dataset Summary Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering Supported Tasks and Leaderboards Multi-modal Multiple Choice Languages English Dataset Structure Data Instances Explore more samples here. {'image': Image, 'question': 'Which of these states is farthest north?', 'choices': ['West Virginia', 'Louisiana', 'Arizona', 'Oklahoma'], 'answer': 0… See the full description on the dataset page: https://huggingface.co/datasets/derek-thomas/ScienceQA.

πŸ“„ 2 papers⬇ 21.3kπŸ’› 234πŸ€— HFcc-by-sa-4.0
HotpotQAEmerging

Dataset Card for BEIR Benchmark hotpotqa is one of the datasets from the Question Answering task within BEIR, measuring Wikipedia article retrieval for a given multi-hop query. Dataset Summary BEIR is a heterogeneous benchmark built from 18 diverse datasets representing 9 information retrieval tasks. Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/hotpotqa.

πŸ“„ 2 papers⬇ 1.4kπŸ’› 16πŸ€— HFcc-by-sa-4.0
KITTIEmerging

Dataset Card for Kitti The Kitti dataset. The Kitti object detection and object orientation estimation benchmark consists of 7481 training images and 7518 test images, comprising a total of 80.256 labeled objects

πŸ“„ 2 papers⬇ 1.3kπŸ’› 5πŸ€— HFunknown
nuScenesEmerging
πŸ“„ 2 papers⬇ 116πŸ€— HFapache-2.0
ALFWorldEmerging
πŸ“„ 2 papers⬇ 19πŸ€— HF
Cantonese JCCOCC MoCAEmerging
πŸ“„ 2 papers
COCOEmerging
πŸ“„ 2 papers
English DementiaBank PittEmerging
πŸ“„ 2 papers
Llama-3.1-8BEmerging
πŸ“„ 2 papers
MATHEmerging
πŸ“„ 2 papers
MIMIC-IVEmerging
πŸ“„ 2 papers
ScanNet++Emerging
πŸ“„ 2 papers
TinyImageNetEmerging
πŸ“„ 2 papers
MMMUCanonical

MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI) 🌐 Homepage | πŸ† Leaderboard | πŸ€— Dataset | πŸ€— Paper | πŸ“– arXiv | GitHub πŸ””News πŸ› οΈ[2026-04-21]: Fixed option issue in test_Psychology_15. ‼️[2026-02-12]: We have released the answers for the test set! You can now evaluate your models on the test set locally! πŸŽ‰ πŸ› οΈ[2024-05-30]: Fixed duplicate option issues in Materials dataset items (validation_Materials_25;… See the full description on the dataset page: https://huggingface.co/datasets/MMMU/MMMU.

πŸ“„ 1 paper⬇ 70.3kπŸ’› 328πŸ€— HFapache-2.0
POPECanonical

Large-scale Multi-modality Models Evaluation Suite Accelerating the development of large-scale multi-modality models (LMMs) with lmms-eval 🏠 Homepage | πŸ“š Documentation | πŸ€— Huggingface Datasets This Dataset This is a formatted version of POPE. It is used in our lmms-eval pipeline to allow for one-click evaluations of large multi-modality models. @article{li2023evaluating, title={Evaluating object hallucination in large vision-language models}, author={Li… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/POPE.

πŸ“„ 1 paper⬇ 33.1kπŸ’› 20πŸ€— HF
Flickr30kCanonical

Flickr30k Original paper: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions Homepage: https://shannon.cs.illinois.edu/DenotationGraph/ Bibtex: @article{young2014image, title={From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions}, author={Young, Peter and Lai, Alice and Hodosh, Micah and Hockenmaier, Julia}, journal={Transactions of the Association… See the full description on the dataset page: https://huggingface.co/datasets/nlphuji/flickr30k.

πŸ“„ 1 paper⬇ 12.3kπŸ’› 105πŸ€— HF
SEED-BenchCanonical

SEED-Bench Card Benchmark details Benchmark type: SEED-Bench is a large-scale benchmark to evaluate Multimodal Large Language Models (MLLMs). It consists of 19K multiple choice questions with accurate human annotations, which covers 12 evaluation dimensions including the comprehension of both the image and video modality. Benchmark date: SEED-Bench was collected in July 2023. Paper or resources for more information: https://github.com/AILab-CVC/SEED-Bench License:… See the full description on the dataset page: https://huggingface.co/datasets/AILab-CVC/SEED-Bench.

πŸ“„ 1 paper⬇ 2.1kπŸ’› 23πŸ€— HFcc-by-nc-4.0
11 widely-used medical datasetsEmerging
πŸ“„ 1 paper
150,000 articulated 3D objectsEmerging
πŸ“„ 1 paper
3D biparametric prostate MRI (bpMRI)Emerging
πŸ“„ 1 paper
8,100 force-closure graspsEmerging
πŸ“„ 1 paper
81 objectsEmerging
πŸ“„ 1 paper
88 eGeMAPSEmerging
πŸ“„ 1 paper
ActorsHQEmerging
πŸ“„ 1 paper
ADNI-1Emerging
πŸ“„ 1 paper
AdsQAEmerging
πŸ“„ 1 paper
AdvBenchEmerging
πŸ“„ 1 paper
AeroEngQAEmerging
πŸ“„ 1 paper
AgNewsEmerging
πŸ“„ 1 paper
AITDNAEmerging
πŸ“„ 1 paper
Alpaca-EvalEmerging
πŸ“„ 1 paper
AmchiBiasEmerging
πŸ“„ 1 paper
AnyContraSynEmerging
πŸ“„ 1 paper
apparent diffusion coefficient map (ADC)Emerging
πŸ“„ 1 paper
AppWorldEmerging
πŸ“„ 1 paper
ArMemeEmerging
πŸ“„ 1 paper
ASVspoof 2019 LAEmerging
πŸ“„ 1 paper
ASVspoof 2021 DFEmerging
πŸ“„ 1 paper
AudioDEREmerging
πŸ“„ 1 paper
Audio MultiChallengeEmerging
πŸ“„ 1 paper
Audio-Visual Conversation CorpusEmerging
πŸ“„ 1 paper
AuthorBenchEmerging
πŸ“„ 1 paper
AxT2Emerging
πŸ“„ 1 paper
Bangla_MMLUEmerging
πŸ“„ 1 paper
BFCLEmerging
πŸ“„ 1 paper
BFCL Multi-TurnEmerging
πŸ“„ 1 paper
BraTS21Emerging
πŸ“„ 1 paper
BraTS-PEDsEmerging
πŸ“„ 1 paper
Bridge2AI-VoiceEmerging
πŸ“„ 1 paper
Busy Beaver problemsEmerging
πŸ“„ 1 paper
CalorieBench-80KEmerging
πŸ“„ 1 paper