Perspectives For Direct Interpretability In Multi-agent Deep Reinforcement Learning
2025 · Yoann Poupart, Aurélie Beynier, Nicolas Maudet
Abstract
Multi-Agent Deep Reinforcement Learning (MADRL) was proven efficient in solving complex problems in robotics or games, yet most of the trained models are hard to interpret. While learning intrinsically interpretable models remains a prominent approach, its scalability and flexibility are limited in handling complex tasks or multi-agent dynamics. This paper advocates for direct interpretability, generating post hoc explanations directly from trained models, as a versatile and scalable alternative, offering insights into agents' behaviour, emergent phenomena, and biases without altering models' architectures. We explore modern methods, including relevance backpropagation, knowledge edition, model steering, activation patching, sparse autoencoders and circuit discovery, to highlight their applicability to single-agent, multi-agent, and training process challenges. By addressing MADRL interpretability, we propose directions aiming to advance active topics such as team identification, swarm
Authors
(none)
Tags
Stats
Related papers
- Deep Reinforcement Learning For Multi-agent Systems: A Review Of Challenges, Solutions And Applications (2018)22.57
- Deep Multiagent Reinforcement Learning: Challenges And Directions (2021)0.00
- An Organizationally-oriented Approach To Enhancing Explainability And Control In Multi-agent Reinforcement Learning (2025)2.26
- Joint Intrinsic Motivation For Coordinated Exploration In Multi-agent Deep Reinforcement Learning (2024)0.00
- Interpretable Learning Dynamics In Unsupervised Reinforcement Learning (2025)0.00
- Multi-agent Deep Reinforcement Learning (MADRL) Meets Multi-user MIMO Systems (2021)7.50
- A Survey Of Multi-agent Deep Reinforcement Learning With Communication (2022)0.00
- A Survey And Critique Of Multiagent Deep Reinforcement Learning (2018)20.07