DeepRed
Emerging1papers using it
2026first seen
DeepRed is an open-source benchmark that evaluates Large Language Model (LLM) agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments, providing full execution traces and a partial-credit scoring method based on challenge-specific checkpoints.