📊 Datasets — Awesome Similarity Search

444 datasets & benchmarks — 14 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.

CIFAR-10Canonical

60,000 32×32 color images in 10 classes — a small, standard image-classification benchmark.

📄 31 papers

NUS-WIDECanonical

The NUS-WIDE dataset is a large-scale benchmark that contains images and their associated tags, used to evaluate cross-modal retrieval methods in multimedia applications.

📄 28 papers

ImageNetCanonical

~1.28M labeled images across 1,000 categories (ILSVRC) — the standard large-scale image-classification benchmark.

📄 27 papers

BEIRCanonical

A heterogeneous benchmark of 18 information-retrieval datasets for zero-shot evaluation of retrieval models.

📄 17 papers

MS MARCOCanonical

A large-scale passage-ranking and question-answering dataset built from real Bing search queries.

70,000 28×28 grayscale images of handwritten digits (0–9) — the classic image-classification benchmark.

📄 12 papers

Recipe-1MEmerging

The 'Recipe-1M' dataset is a large-scale benchmark containing one million recipes paired with corresponding images, used to evaluate cross-modal image-recipe retrieval methods.

📄 7 papers

Stanford Online ProductEmerging

The Stanford Online Product dataset is a benchmark used to evaluate image retrieval performance, containing images of products from various categories.

📄 7 papers

BigANNCanonical

'BigANN' is a benchmark that contains three large-scale public datasets of up to one billion visual descriptors, used to evaluate approximate nearest neighbor search methods.

📄 6 papers

CIFAR-100Emerging

CIFAR-100 is a dataset containing 60,000 32x32 color images across 100 classes, used to evaluate image classification and retrieval performance in machine learning models.

sift1m sift1m data, copied from http://corpus-texmex.irisa.fr/, published: Jégou H, Douze M, Schmid C. Improving bag-of-features for large scale image search[J]. International journal of computer vision, 2010, 87(3): 316-336.

📄 6 papers

SketchyEmerging

The 'Sketchy' dataset is a benchmark that contains hand-drawn sketches paired with natural images, used to evaluate low-shot sketch-based image retrieval tasks.

📄 6 papers

Tu BerlinEmerging

Dataset Card for TU Berline Dataset This dataset card aims to provide comprehensive information about the TU Berlin dataset, a collection of hand-drawn sketches used for training and evaluating sketch classification models. Dataset Details Dataset Description The TU Berlin dataset is a large-scale collection of hand-drawn sketches curated by the research team at TU Berlin. The dataset includes 20,000 unique sketches across 250 object categories, contributed by… See the full description on the dataset page: https://huggingface.co/datasets/sdiaeyu6n/tu-berlin.

📄 6 papers

CUB-200-2011Emerging

The CUB-200-2011 dataset is a benchmark that contains images of 200 bird species and is used to evaluate content-based image retrieval methods by assessing their ability to retrieve visually and semantically similar images.

📄 5 papers

MIRFlickr-25KEmerging

The 'MIRFlickr-25K' dataset contains 25,000 images and their associated textual descriptions, and it is used to evaluate cross-modal retrieval methods.

📄 5 papers

Natural QuestionsCanonical

Real Google search queries paired with Wikipedia pages and annotated with long and short answers, for open-domain QA.

📄 5 papers

Cars-196Emerging

The 'Cars-196' dataset is a benchmark that contains images of 196 different car models and is used to evaluate retrieval performance in distance metric learning.

📄 4 papers

COCOEmerging

The COCO dataset is a large-scale image dataset that contains images with annotations for object detection, segmentation, and captioning, and it is used to evaluate the performance of models in these tasks.

📄 4 papers

DeepFashionEmerging

DeepFashion is a dataset used for evaluating content-based image retrieval methods in the e-commerce domain, containing a diverse collection of fashion images.

📄 4 papers

In-ShopEmerging

The 'In-shop' dataset is a benchmark used to evaluate retrieval performance in distance metric learning, containing labeled data for assessing the effectiveness of models in identifying similar items within a shopping context.

📄 4 papers

LoTTEEmerging

LoTTE Passages Dataset for ColBERTv2

📄 4 papers

Market-1501Emerging

The Market-1501 dataset is a benchmark that contains a collection of images used to evaluate image retrieval performance, particularly in the context of person re-identification.

📄 4 papers

ROxfordEmerging

ROxford is a benchmark dataset used to evaluate image retrieval performance, specifically for particular object retrieval tasks.

📄 4 papers

BRIGHTEmerging

The 'BRIGHT' dataset/benchmark is used to evaluate the effectiveness of text embedding models in retrieval and listwise reranking tasks.

📄 3 papers

Deep-1MEmerging

The 'Deep-1M' dataset is a benchmark used to evaluate nearest-neighbor search accuracy in high-dimensional vector spaces.

📄 3 papers

FCVIDEmerging

The FCVID dataset is a benchmark used to evaluate mid-stream video-to-video retrieval performance, containing videos that allow for the assessment of retrieval methods under conditions where only the beginning part of a video is available as a query.

📄 3 papers

Flickr30kEmerging

The 'Flickr-30K' dataset contains 30,000 images, each paired with five descriptive captions, and is used to evaluate language-based image retrieval methods.

📄 3 papers

HolidaysEmerging

The 'Holidays' dataset is a benchmark used to evaluate image retrieval performance, containing a collection of images designed to assess visual similarity tasks.

📄 3 papers

LFWEmerging

Samples from the LFW dataset. Samples where there is one more face per user were selected. They were then partitioned into two directories: ingestion and recovery. This was done to test a facial recognition system.

📄 3 papers

MIRFlickrEmerging

The 'MIRFlickr' dataset contains a collection of images and associated textual descriptions, and it is used to evaluate cross-modal retrieval techniques.

📄 3 papers

Movie-lens datasetEmerging

The Movie-lens dataset is a collection of user ratings for movies that is commonly used to evaluate collaborative filtering algorithms in recommendation systems.

📄 3 papers

OpenImagesEmerging

OpenImages is a large-scale dataset containing millions of labeled images used to evaluate active learning and search methods in computer vision.

📄 3 papers

Oxford-5kEmerging

The 'Oxford 5k' dataset is a benchmark containing images of five thousand landmark photographs used to evaluate instance-level image retrieval methods.

📄 3 papers

Paris-6kEmerging

The 'Paris 6k' dataset is a large-scale landmark dataset used to evaluate instance-level image retrieval methods.

📄 3 papers

TriviaQAEmerging

TriviaQA is a benchmark dataset that contains a collection of trivia questions and their corresponding answers, used to evaluate the performance of open-domain question answering systems.

📄 3 papers

WikiEmerging

The 'WiKi' dataset is a benchmark that contains images and their corresponding textual descriptions, used to evaluate cross-modal retrieval techniques.

📄 3 papers

ActivityNetEmerging

ActivityNet is a benchmark dataset that contains a diverse collection of videos annotated with human activities, used to evaluate the performance of text-video retrieval methods.

📄 2 papers

ANN-BenchmarksEmerging

The 'ANN-Benchmarks' dataset is a benchmarking framework used to evaluate the performance of various approximate nearest neighbor search algorithms in vector databases.

📄 2 papers

arXiv.orgEmerging

arXiv.org is a repository that contains over 29 million mathematical expressions extracted from more than 900,000 scientific publications, and it is used to evaluate the effectiveness of machine learning approaches for retrieving relevant mathematical problem descriptions in research articles.

📄 2 papers

BUCC datasetEmerging

The BUCC dataset is a benchmark used for evaluating parallel corpus mining, containing aligned sentences across multiple languages.

📄 2 papers

CC_WEB_VIDEOEmerging

The CC_WEB_VIDEO dataset contains a collection of multimedia videos used to evaluate near-duplicate video retrieval (NDVR) methods.

📄 2 papers

Conceptual CaptionsEmerging

The 'Conceptual Captions' dataset contains image-caption pairs used to evaluate models on their ability to generate and understand natural language descriptions of images.

📄 2 papers

CUB-200Emerging

CUB-200 is a dataset that contains images of 200 bird species and is used to evaluate retrieval performance in distance metric learning.

📄 2 papers

CUHK-03Emerging

The 'CUHK-03' dataset is a benchmark that contains images of pedestrians captured from multiple camera views, used to evaluate person re-identification methods.

📄 2 papers

CVLEmerging

The 'CVL' dataset is a benchmark used to evaluate writer identification and writer retrieval in the document analysis and recognition field.

📄 2 papers

Deep1BCanonical

deep1B deep1B data, copied from https://research.yandex.com/blog/benchmarks-for-billion-scale-similarity-search, published: Babenko A, Lempitsky V. Efficient indexing of billion-scale datasets of deep descriptors[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 2055-2063.

📄 2 papers

ESSEXEmerging

The 'ESSEX' dataset is one of the benchmarks used to evaluate face retrieval methods in the context of computer vision.

📄 2 papers

Fashion-MNISTCanonical

A drop-in MNIST replacement with 70,000 grayscale images across 10 clothing categories.

📄 2 papers

FlickrEmerging

Flickr is a dataset that contains image-text pairs used to evaluate multimodal retrieval performance in vision-language tasks.

📄 2 papers

Flickr25kCanonical

Flickr25k is a dataset containing 25,000 images used to evaluate unsupervised hashing methods for efficient semantic retrieval and compact storage.

📄 2 papers

GloVeCanonical

Pre-trained vectors from GloVe: Global Vectors for Word Representation The 50-dimensional embeddings from https://nlp.stanford.edu/projects/glove/.

📄 2 papers

IAPR TC-12Emerging

The IAPR TC-12 is a dataset that contains a collection of images and their associated textual descriptions, used to evaluate the effectiveness of cross-modal retrieval techniques between visual and textual data.

📄 2 papers

ImageNet-100Emerging

ImageNet-100 is a subset of the ImageNet dataset containing 100 classes used to evaluate the performance of image classification and retrieval methods.

📄 2 papers

IRMA 2009Emerging

The IRMA 2009 dataset contains 14,410 x-ray images categorized into 57 classes and is used to evaluate the performance of content-based medical image retrieval methods.

📄 2 papers

KITTIEmerging

The KITTI dataset is a benchmark that contains a variety of real-world driving scenarios, including RGB images and LiDAR point clouds, used to evaluate visual place recognition and related tasks in autonomous driving.

📄 2 papers

MEDLINEEmerging

MEDLINE is a large biomedical document collection that uses the Medical Subject Headings (MeSH) thesaurus as a controlled vocabulary, and it is used to evaluate methods for automatic semantic indexing and text categorization.

📄 2 papers

MLDoc datasetEmerging

The MLDoc dataset is a benchmark used to evaluate cross-lingual document classification across multiple languages.

📄 2 papers

MS MARCO PassagesEmerging

📄 2 papers

Loading datasets…