Unveiling The Black Box: A Multi-layer Framework For Explaining Reinforcement Learning-based Cyber Agents
2025 Β· Diksha Goel, Kristen Moore, Jeff Wang, et al.
Abstract
Reinforcement Learning (RL) agents are increasingly used to simulate sophisticated cyberattacks, but their decision-making processes remain opaque, hindering trust, debugging, and defensive preparedness. In high-stakes cybersecurity contexts, explainability is essential for understanding how adversarial strategies are formed and evolve over time. In this paper, we propose a unified, multi-layer explainability framework for RL-based attacker agents that reveals both strategic (Markov Decision Process (MDP)-level) and tactical (policy-level) reasoning. At the MDP-level, we model cyberattacks as a Partially Observable Markov Decision Process (POMDP) to expose exploration-exploitation dynamics and phase-aware behavioural shifts. At the policy-level, we analyse the temporal evolution of Q-values and use Prioritised Experience Replay (PER) to surface critical learning transitions and evolving action preferences. Evaluated across CyberBattleSim environments of increasing complexity, our frame
Authors
(none)
Tags
Stats
Related papers
- Constrained Black-box Attacks Against Cooperative Multi-agent Reinforcement Learning (2025)0.00
- MAGIC-MASK: Multi-agent Guided Inter-agent Collaboration With Mask-based Explainability For Reinforcement Learning (2025)0.00
- Learning To Cope With Adversarial Attacks (2019)0.00
- A Framework For Adversarial Analysis Of Decision Support Systems Prior To Deployment (2025)0.00
- A Survey On Explainable Reinforcement Learning: Concepts, Algorithms, Challenges (2022)0.00
- Why The Agent Made That Decision: Contrastive Explanation Learning For Reinforcement Learning (2024)0.00
- Beyond Rewards In Reinforcement Learning For Cyber Defence (2026)0.00
- SUB-PLAY: Adversarial Policies Against Partially Observed Multi-agent Reinforcement Learning Systems (2024)0.00