📊 Datasets — Awesome Multimodal

401 datasets & benchmarks — 20 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.

GSM8KEmerging

Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

📄 3 papers⬇ 895.3k💛 1.4k🤗 HFmit

LIBEROEmerging

This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v3.0", "robot_type": "panda", "total_episodes": 1693, "total_frames": 273465, "total_tasks": 40, "chunks_size": 1000, "fps": 10.0, "splits": { "train": "0:1693" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path": "videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4"… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceVLA/libero.

📄 3 papers⬇ 23.2k💛 60🤗 HFapache-2.0

HumanEvalEmerging

HumanEval-X is a benchmark for the evaluation of the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks.

📄 3 papers⬇ 1.8k💛 95🤗 HFapache-2.0

Dataset Card for Mostly Basic Python Problems (mbpp) Dataset Summary The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us. Released here as part of… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/mbpp.

📄 2 papers⬇ 183.8k💛 230🤗 HFcc-by-4.0

MNISTEmerging

Dataset Card for MNIST Dataset Summary The MNIST dataset consists of 70,000 28x28 black-and-white images of handwritten digits extracted from two NIST databases. There are 60,000 images in the training dataset and 10,000 images in the validation dataset, one class per digit so a total of 10 classes, with 7,000 images (6,000 train images and 1,000 test images) per class. Half of the image were drawn by Census Bureau employees and the other half by high school students… See the full description on the dataset page: https://huggingface.co/datasets/ylecun/mnist.

📄 2 papers⬇ 149.3k💛 248🤗 HFmit

BrowseComp-PlusEmerging

BrowseComp-Plus BrowseComp-Plus is a new benchmark for Deep-Research system, isolating the effect of the retriever and the LLM agent to enable fair, transparent comparisons of Deep-Research agents. The benchmark sources challenging, reasoning-intensive queries from OpenAI's BrowseComp. However, instead of searching the live web, BrowseComp-Plus evaluates against a fixed, curated corpus of ~100K web documents from the web. The corpus includes both human-verified evidence documents… See the full description on the dataset page: https://huggingface.co/datasets/Tevatron/browsecomp-plus.

📄 2 papers⬇ 30.9k💛 35🤗 HFmit

ScienceQACanonical

Dataset Card Creation Guide Dataset Summary Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering Supported Tasks and Leaderboards Multi-modal Multiple Choice Languages English Dataset Structure Data Instances Explore more samples here. {'image': Image, 'question': 'Which of these states is farthest north?', 'choices': ['West Virginia', 'Louisiana', 'Arizona', 'Oklahoma'], 'answer': 0… See the full description on the dataset page: https://huggingface.co/datasets/derek-thomas/ScienceQA.

📄 2 papers⬇ 21.3k💛 234🤗 HFcc-by-sa-4.0

HotpotQAEmerging

Dataset Card for BEIR Benchmark hotpotqa is one of the datasets from the Question Answering task within BEIR, measuring Wikipedia article retrieval for a given multi-hop query. Dataset Summary BEIR is a heterogeneous benchmark built from 18 diverse datasets representing 9 information retrieval tasks. Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/hotpotqa.

📄 2 papers⬇ 1.4k💛 16🤗 HFcc-by-sa-4.0

KITTIEmerging

Dataset Card for Kitti The Kitti dataset. The Kitti object detection and object orientation estimation benchmark consists of 7481 training images and 7518 test images, comprising a total of 80.256 labeled objects

📄 2 papers⬇ 1.3k💛 5🤗 HFunknown

nuScenesEmerging

📄 2 papers⬇ 116🤗 HFapache-2.0

ALFWorldEmerging

📄 2 papers⬇ 19🤗 HF

Cantonese JCCOCC MoCAEmerging

📄 2 papers

COCOEmerging

📄 2 papers

English DementiaBank PittEmerging

MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI) 🌐 Homepage | 🏆 Leaderboard | 🤗 Dataset | 🤗 Paper | 📖 arXiv | GitHub 🔔News 🛠️[2026-04-21]: Fixed option issue in test_Psychology_15. ‼️[2026-02-12]: We have released the answers for the test set! You can now evaluate your models on the test set locally! 🎉 🛠️[2024-05-30]: Fixed duplicate option issues in Materials dataset items (validation_Materials_25;… See the full description on the dataset page: https://huggingface.co/datasets/MMMU/MMMU.

📄 1 paper⬇ 70.3k💛 328🤗 HFapache-2.0

POPECanonical

Large-scale Multi-modality Models Evaluation Suite Accelerating the development of large-scale multi-modality models (LMMs) with lmms-eval 🏠 Homepage | 📚 Documentation | 🤗 Huggingface Datasets This Dataset This is a formatted version of POPE. It is used in our lmms-eval pipeline to allow for one-click evaluations of large multi-modality models. @article{li2023evaluating, title={Evaluating object hallucination in large vision-language models}, author={Li… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/POPE.

📄 1 paper⬇ 33.1k💛 20🤗 HF

Flickr30kCanonical

Flickr30k Original paper: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions Homepage: https://shannon.cs.illinois.edu/DenotationGraph/ Bibtex: @article{young2014image, title={From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions}, author={Young, Peter and Lai, Alice and Hodosh, Micah and Hockenmaier, Julia}, journal={Transactions of the Association… See the full description on the dataset page: https://huggingface.co/datasets/nlphuji/flickr30k.

📄 1 paper⬇ 12.3k💛 105🤗 HF

SEED-BenchCanonical

SEED-Bench Card Benchmark details Benchmark type: SEED-Bench is a large-scale benchmark to evaluate Multimodal Large Language Models (MLLMs). It consists of 19K multiple choice questions with accurate human annotations, which covers 12 evaluation dimensions including the comprehension of both the image and video modality. Benchmark date: SEED-Bench was collected in July 2023. Paper or resources for more information: https://github.com/AILab-CVC/SEED-Bench License:… See the full description on the dataset page: https://huggingface.co/datasets/AILab-CVC/SEED-Bench.