← all datasets

DeepRed

Emerging
1papers using it
2026first seen

DeepRed is an open-source benchmark that evaluates Large Language Model (LLM) agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments, providing full execution traces and a partial-credit scoring method based on challenge-specific checkpoints.

Papers using DeepRed (1)

DeepRed β€” datasets β€” cybersecurity