DeepRed

Emerging

1papers using it

2026first seen

DeepRed is an open-source benchmark that evaluates Large Language Model (LLM) agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments, providing full execution traces and a partial-credit scoring method based on challenge-specific checkpoints.

🔎 Find this dataset

Papers using DeepRed (1)

Do Agents Dream Of Root Shells? Partial-credit Evaluation Of LLM Agents In Capture The Flag Challenges2026