πŸ“Š Datasets β€” Awesome Similarity Search

336 datasets & benchmarks β€” 14 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.

336 of 336 datasets
GSM8KEmerging

Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ βˆ’ Γ—Γ·) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

πŸ“„ 3 papers⬇ 895.3kπŸ’› 1.4kπŸ€— HFmit
LIBEROEmerging

This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v3.0", "robot_type": "panda", "total_episodes": 1693, "total_frames": 273465, "total_tasks": 40, "chunks_size": 1000, "fps": 10.0, "splits": { "train": "0:1693" }, "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet", "video_path": "videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4"… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceVLA/libero.

πŸ“„ 3 papers⬇ 23.2kπŸ’› 60πŸ€— HFapache-2.0
CIFAR-10Canonical
πŸ“„ 2 papers⬇ 1.7kπŸ€— HF
BEIRCanonical

Dataset Card for BEIR Benchmark Dataset Summary BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks: Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04 Argument Retrieval: Touche-2020, ArguAna Duplicate Question Retrieval: Quora, CqaDupstack Citation-Prediction: SCIDOCS Tweet… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/beir.

πŸ“„ 2 papers⬇ 270πŸ’› 10πŸ€— HFcc-by-sa-4.0
LoCoMoEmerging
πŸ“„ 2 papers⬇ 250πŸ’› 3πŸ€— HF
ImageNetCanonical
πŸ“„ 2 papers⬇ 32πŸ€— HF
MS-COCOEmerging
πŸ“„ 2 papers⬇ 31πŸ€— HF
WHU-CDEmerging
πŸ“„ 2 papers⬇ 6πŸ€— HF
LongMemEval_SEmerging
πŸ“„ 2 papers⬇ 5πŸ€— HF
VBenchEmerging
πŸ“„ 2 papers⬇ 4πŸ€— HF
CIFAR-100Emerging
πŸ“„ 2 papers
LEVIR-CDEmerging
πŸ“„ 2 papers
T2ICompBenchEmerging
πŸ“„ 2 papers
MNISTCanonical

Dataset Card for MNIST Dataset Summary The MNIST dataset consists of 70,000 28x28 black-and-white images of handwritten digits extracted from two NIST databases. There are 60,000 images in the training dataset and 10,000 images in the validation dataset, one class per digit so a total of 10 classes, with 7,000 images (6,000 train images and 1,000 test images) per class. Half of the image were drawn by Census Bureau employees and the other half by high school students… See the full description on the dataset page: https://huggingface.co/datasets/ylecun/mnist.

πŸ“„ 1 paper⬇ 149.3kπŸ’› 248πŸ€— HFmit
NUS-WIDECanonical
πŸ“„ 1 paper⬇ 23πŸ€— HF
100,000 drug-like moleculesEmerging
πŸ“„ 1 paper
200 expert-validated queriesEmerging
πŸ“„ 1 paper
200-prompt benchmarkEmerging
πŸ“„ 1 paper
2026 SoccerNet VQA ChallengeEmerging
πŸ“„ 1 paper
256^2 DNS datasetEmerging
πŸ“„ 1 paper
2D dynamic controlEmerging
πŸ“„ 1 paper
2WikiMultiHopQAEmerging
πŸ“„ 1 paper
3DGSEmerging
πŸ“„ 1 paper
3DGUTEmerging
πŸ“„ 1 paper
40-dimensional Lorenz--96 testbedEmerging
πŸ“„ 1 paper
47 reasoning-trap questionsEmerging
πŸ“„ 1 paper
4DP-QAEmerging
πŸ“„ 1 paper
4DP-QA-BenchEmerging
πŸ“„ 1 paper
50-task hotel expense benchmarkEmerging
πŸ“„ 1 paper
50-user synthetic corpusEmerging
πŸ“„ 1 paper
540-image benchmarkEmerging
πŸ“„ 1 paper
61 anonymised examsEmerging
πŸ“„ 1 paper
AASISTEmerging
πŸ“„ 1 paper
ABMILEmerging
πŸ“„ 1 paper
ACCIDENT @ CVPREmerging
πŸ“„ 1 paper
ACI-IoT-2023Emerging
πŸ“„ 1 paper
ACM Multimedia AVI Challenge 2026Emerging
πŸ“„ 1 paper
ActivityNet-1.3Emerging
πŸ“„ 1 paper
ActivityNet-GZSLEmerging
πŸ“„ 1 paper
ADNIEmerging
πŸ“„ 1 paper
AIBLEmerging
πŸ“„ 1 paper
ALFWorldEmerging
πŸ“„ 1 paper
Allen-CahnEmerging
πŸ“„ 1 paper
AllShowersEmerging
πŸ“„ 1 paper
Amazon Berkeley ObjectsEmerging
πŸ“„ 1 paper
AntMazeEmerging
πŸ“„ 1 paper
A-OKVQAEmerging
πŸ“„ 1 paper
ARC-TGIEmerging
πŸ“„ 1 paper
Artifact-BenchEmerging
πŸ“„ 1 paper
ASTEEREmerging
πŸ“„ 1 paper
ASVspoof5Emerging
πŸ“„ 1 paper
AutoPET-IIIEmerging
πŸ“„ 1 paper
BanuyoEmerging
πŸ“„ 1 paper
BashArenaEmerging
πŸ“„ 1 paper
BBBC021Emerging
πŸ“„ 1 paper
BDD100KEmerging
πŸ“„ 1 paper
Bench2DriEmerging
πŸ“„ 1 paper
Berkeley Deep Drive Dataset (BDD100K)Emerging
πŸ“„ 1 paper
BiogenEmerging
πŸ“„ 1 paper
BraTS 2023Emerging
πŸ“„ 1 paper