Interpretable Failure Analysis In Multi-agent Reinforcement Learning Systems
2026 Β· Risal Shahriar Shefin, Debashis Gupta, Thai Le, et al.
Abstract
Multi-Agent Reinforcement Learning (MARL) is increasingly deployed in safety-critical domains, yet methods for interpretable failure detection and attribution remain underdeveloped. We introduce a two-stage gradient-based framework that provides interpretable diagnostics for three critical failure analysis tasks: (1) detecting the true initial failure source (Patient-0); (2) validating why non-attacked agents may be flagged first due to domino effects; and (3) tracing how failures propagate through learned coordination pathways. Stage 1 performs interpretable per-agent failure detection via Taylor-remainder analysis of policy-gradient costs, declaring an initial Patient-0 candidate at the first threshold crossing. Stage 2 provides validation through geometric analysis of critic derivatives-first-order sensitivity and directional second-order curvature aggregated over causal windows to construct interpretable contagion graphs. This approach explains "downstream-first" detection anomalie
Authors
(none)
Tags
Stats
Related papers
- Vulnerable Agent Identification In Large-scale Multi-agent Reinforcement Learning (2025)0.00
- Tackling Uncertainties In Multi-agent Reinforcement Learning Through Integration Of Agent Termination Dynamics (2025)2.26
- Multi-agent Reinforcement Learning In Stochastic Networked Systems (2020)0.00
- Fault Tolerant Multi-agent Learning With Adversarial Budget Constraints (2025)0.00
- Understanding Individual Decision-making In Multi-agent Reinforcement Learning: A Dynamical Systems Approach (2025)0.00
- A Principle Of Targeted Intervention For Multi-agent Reinforcement Learning (2025)0.00
- Collaborative Adaptation For Recovery From Unforeseen Malfunctions In Discrete And Continuous MARL Domains (2024)3.58
- Safe Multi-agent Reinforcement Learning With Convergence To Generalized Nash Equilibrium (2024)0.00