π Datasets β Awesome Cybersecurity
271 datasets & benchmarks β 13 canonical foundations plus emerging datasets mined from recent papers. Each links to the papers that use it.
Source https://www.kaggle.com/datasets/dhoogla/unswnb15?resource=download Dataset This is an academic intrusion detection dataset. All the credit goes to the original authors: dr. Nour Moustafa and dr. Jill Slay. Please cite their original paper and all other appropriate articles listed on the UNSW-NB15 page. The full dataset also offers the pcap, BRO and Argus files along with additional documentation. The modifications to the predesignated train-test sets are minimal⦠See the full description on the dataset page: https://huggingface.co/datasets/wwydmanski/UNSW-NB15.
We have developed a Python package as a wrapper around Hugging Face Hub and Hugging Face Datasets library to access this dataset easily. NIDS Datasets The nids-datasets package provides functionality to download and utilize specially curated and extracted datasets from the original UNSW-NB15 and CIC-IDS2017 datasets. These datasets, which initially were only flow datasets, have been enhanced to include packet-level information from the raw PCAP files. The dataset contains both⦠See the full description on the dataset page: https://huggingface.co/datasets/rdpahalavan/CIC-IDS2017.
NSL-KDD The data set is a data set that converts the arff File provided by the link into CSV and results. The data set is personally stored by converting data to float64. If you want to obtain additional original files, they are organized in the Original Directory in the repo. Labels The label of the data set is as follows. # Column Non-Null Count Dtype 0 duration 151165 non-null int64 1 protocol_type 151165 non-null object 2 service 151165 non-null⦠See the full description on the dataset page: https://huggingface.co/datasets/Mireu-Lab/NSL-KDD.
Dataset Card for MNIST Dataset Summary The MNIST dataset consists of 70,000 28x28 black-and-white images of handwritten digits extracted from two NIST databases. There are 60,000 images in the training dataset and 10,000 images in the validation dataset, one class per digit so a total of 10 classes, with 7,000 images (6,000 train images and 1,000 test images) per class. Half of the image were drawn by Census Bureau employees and the other half by high school students⦠See the full description on the dataset page: https://huggingface.co/datasets/ylecun/mnist.
Dataset Card for MATH-500 This dataset contains a subset of 500 problems from the MATH benchmark that OpenAI created in their Let's Verify Step by Step paper. See their GitHub repo for the source file: https://github.com/openai/prm800k/tree/main?tab=readme-ov-file#math-splits
Dataset Summary SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systemsβ ability to solve GitHub issues automatically. See this post for more details on the human-validation process. The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The originalβ¦ See the full description on the dataset page: https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified.
Elliptic Bitcoin Dataset Dataset Description This is the Elliptic Bitcoin dataset used for illicit transaction detection in cryptocurrency networks. The dataset contains Bitcoin transaction data with labeled illicit and licit transactions. Dataset Structure The dataset consists of three CSV files: elliptic_txs_features.csv: Transaction features (166 features per transaction) 94 local features (derived from transaction information) 72 aggregated features⦠See the full description on the dataset page: https://huggingface.co/datasets/yhoma/elliptic-bitcoin-dataset.
Recipe and flavor pairings dataset(s) to be used for LLM training.