A Theoretical Framework For Explaining Reinforcement Learning With Shapley Values
2025 · Daniel Beechey, Thomas M. S. Smith, Özgür Şimşek
Abstract
Reinforcement learning agents can achieve super-human performance in complex decision-making tasks, but their behaviour is often difficult to understand and explain. This lack of explanation limits deployment, especially in safety-critical settings where understanding and trust are essential. We identify three core explanatory targets that together provide a comprehensive view of reinforcement learning agents: behaviour, outcomes, and predictions. We develop a unified theoretical framework for explaining these three elements of reinforcement learning agents through the influence of individual features that the agent observes in its environment. We derive feature influences by using Shapley values, which collectively and uniquely satisfy a set of well-motivated axioms for fair and consistent credit assignment. The proposed approach, Shapley Values for Explaining Reinforcement Learning (SVERL), provides a single theoretical framework to comprehensively and meaningfully explain reinforcem
Authors
(none)
Tags
Stats
Related papers
- Explaining Reinforcement Learning With Shapley Values (2023)0.00
- Collective Explainable AI: Explaining Cooperative Strategies And Agent Contribution In Multiagent Reinforcement Learning With Shapley Values (2021)0.00
- Explaining Reinforcement Learning: A Counterfactual Shapley Values Approach (2024)0.00
- SHAQ: Incorporating Shapley Value Theory Into Multi-agent Q-learning (2021)0.00
- From Explainability To Interpretability: Interpretable Policies In Reinforcement Learning Via Model Explanation (2025)0.00
- The Shapley Value In Machine Learning (2022)17.35
- Experiential Explanations For Reinforcement Learning (2022)2.26
- Explainability In Deep Reinforcement Learning, A Review Into Current Methods And Applications (2022)12.33