πŸ“Š Datasets β€” Awesome Cybersecurity

271 datasets & benchmarks β€” 13 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.

271 of 271 datasets
UNSW-NB15Canonical

Source https://www.kaggle.com/datasets/dhoogla/unswnb15?resource=download Dataset This is an academic intrusion detection dataset. All the credit goes to the original authors: dr. Nour Moustafa and dr. Jill Slay. Please cite their original paper and all other appropriate articles listed on the UNSW-NB15 page. The full dataset also offers the pcap, BRO and Argus files along with additional documentation. The modifications to the predesignated train-test sets are minimal… See the full description on the dataset page: https://huggingface.co/datasets/wwydmanski/UNSW-NB15.

πŸ“„ 8 papers⬇ 244πŸ’› 1πŸ€— HF
CIFAR-100Emerging
πŸ“„ 7 papers
CICIDS2017Canonical

We have developed a Python package as a wrapper around Hugging Face Hub and Hugging Face Datasets library to access this dataset easily. NIDS Datasets The nids-datasets package provides functionality to download and utilize specially curated and extracted datasets from the original UNSW-NB15 and CIC-IDS2017 datasets. These datasets, which initially were only flow datasets, have been enhanced to include packet-level information from the raw PCAP files. The dataset contains both… See the full description on the dataset page: https://huggingface.co/datasets/rdpahalavan/CIC-IDS2017.

πŸ“„ 5 papers⬇ 1.7kπŸ’› 4πŸ€— HFapache-2.0
CIFAR-10Emerging
πŸ“„ 5 papers⬇ 1.7kπŸ€— HF
NSL-KDDCanonical

NSL-KDD The data set is a data set that converts the arff File provided by the link into CSV and results. The data set is personally stored by converting data to float64. If you want to obtain additional original files, they are organized in the Original Directory in the repo. Labels The label of the data set is as follows. # Column Non-Null Count Dtype 0 duration 151165 non-null int64 1 protocol_type 151165 non-null object 2 service 151165 non-null… See the full description on the dataset page: https://huggingface.co/datasets/Mireu-Lab/NSL-KDD.

πŸ“„ 4 papers⬇ 184πŸ’› 5πŸ€— HFgpl-3.0
ImageNet-100Emerging
πŸ“„ 3 papers⬇ 26πŸ€— HFmit
MNISTEmerging

Dataset Card for MNIST Dataset Summary The MNIST dataset consists of 70,000 28x28 black-and-white images of handwritten digits extracted from two NIST databases. There are 60,000 images in the training dataset and 10,000 images in the validation dataset, one class per digit so a total of 10 classes, with 7,000 images (6,000 train images and 1,000 test images) per class. Half of the image were drawn by Census Bureau employees and the other half by high school students… See the full description on the dataset page: https://huggingface.co/datasets/ylecun/mnist.

πŸ“„ 2 papers⬇ 149.3kπŸ’› 248πŸ€— HFmit
MATH-500Emerging

Dataset Card for MATH-500 This dataset contains a subset of 500 problems from the MATH benchmark that OpenAI created in their Let's Verify Step by Step paper. See their GitHub repo for the source file: https://github.com/openai/prm800k/tree/main?tab=readme-ov-file#math-splits

πŸ“„ 2 papers⬇ 141.4kπŸ’› 316πŸ€— HF
SWE-Bench-VerifiedEmerging

Dataset Summary SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process. The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The original… See the full description on the dataset page: https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified.

πŸ“„ 2 papers⬇ 69.6kπŸ’› 95πŸ€— HF
Elliptic BitcoinEmerging

Elliptic Bitcoin Dataset Dataset Description This is the Elliptic Bitcoin dataset used for illicit transaction detection in cryptocurrency networks. The dataset contains Bitcoin transaction data with labeled illicit and licit transactions. Dataset Structure The dataset consists of three CSV files: elliptic_txs_features.csv: Transaction features (166 features per transaction) 94 local features (derived from transaction information) 72 aggregated features… See the full description on the dataset page: https://huggingface.co/datasets/yhoma/elliptic-bitcoin-dataset.

πŸ“„ 2 papers⬇ 347πŸ€— HFmit
MalImgEmerging
πŸ“„ 2 papers⬇ 78πŸ€— HF
CICDDoS2019Emerging
πŸ“„ 2 papers⬇ 25πŸ€— HF
CSE-CIC-IDS2018Canonical
πŸ“„ 2 papers⬇ 9πŸ€— HFecl-2.0
EMBERCanonical

Recipe and flavor pairings dataset(s) to be used for LLM training.

πŸ“„ 2 papers⬇ 4πŸ€— HFmit
AgentDojoEmerging
πŸ“„ 2 papers
CICIDS 2023Emerging
πŸ“„ 2 papers
IEEE-CIS Fraud Detection benchmarkEmerging
πŸ“„ 2 papers
Llama-3.2-1BEmerging
πŸ“„ 2 papers
PRISMAEmerging
πŸ“„ 2 papers
Qwen2.5-0.5BEmerging
πŸ“„ 2 papers
Qwen3-0.6BEmerging
πŸ“„ 2 papers
Qwen3-4BEmerging
πŸ“„ 2 papers
ToN-IoTCanonical
πŸ“„ 2 papers
VirusTotalEmerging
πŸ“„ 2 papers
Bot-IoTCanonical
πŸ“„ 1 paper⬇ 8πŸ€— HF
110-node, 181-edge instanceEmerging
πŸ“„ 1 paper
26B-A4B Mixture-of-ExpertsEmerging
πŸ“„ 1 paper
2D airfoilEmerging
πŸ“„ 1 paper
3D carEmerging
πŸ“„ 1 paper
7,200 image datasetEmerging
πŸ“„ 1 paper
8,100 force-closure graspsEmerging
πŸ“„ 1 paper
81 objectsEmerging
πŸ“„ 1 paper
AACR Project GENIE Biopharma Collaborative datasetEmerging
πŸ“„ 1 paper
Abuse.chEmerging
πŸ“„ 1 paper
AbuseIPDBEmerging
πŸ“„ 1 paper
ACS CensusEmerging
πŸ“„ 1 paper
ADNIEmerging
πŸ“„ 1 paper
AIME 2025Emerging
πŸ“„ 1 paper
AI-Pentest-BenchmarkEmerging
πŸ“„ 1 paper
ALFWorldEmerging
πŸ“„ 1 paper
AlienVault OTXEmerging
πŸ“„ 1 paper
AmBenchEmerging
πŸ“„ 1 paper
Android WorldEmerging
πŸ“„ 1 paper
API call sequencesEmerging
πŸ“„ 1 paper
AppWorldEmerging
πŸ“„ 1 paper
APT28Emerging
πŸ“„ 1 paper
APT29Emerging
πŸ“„ 1 paper
APT41Emerging
πŸ“„ 1 paper
APT44Emerging
πŸ“„ 1 paper
ARFBenchEmerging
πŸ“„ 1 paper
Argoverse 2Emerging
πŸ“„ 1 paper
ASBEmerging
πŸ“„ 1 paper
ATRDF 2023Emerging
πŸ“„ 1 paper
AutoPenBenchEmerging
πŸ“„ 1 paper
AWSEmerging
πŸ“„ 1 paper
Azure tenantEmerging
πŸ“„ 1 paper
BaseJump STLEmerging
πŸ“„ 1 paper
BashArenaEmerging
πŸ“„ 1 paper
BCI-IV-2aEmerging
πŸ“„ 1 paper
BGLEmerging
πŸ“„ 1 paper