Interpretable Learning Dynamics In Unsupervised Reinforcement Learning
2025 Β· Shashwat Pandey
Abstract
We present an interpretability framework for unsupervised reinforcement learning (URL) agents, aimed at understanding how intrinsic motivation shapes attention, behavior, and representation learning. We analyze five agents DQN, RND, ICM, PPO, and a Transformer-RND variant trained on procedurally generated environments, using Grad-CAM, Layer-wise Relevance Propagation (LRP), exploration metrics, and latent space clustering. To capture how agents perceive and adapt over time, we introduce two metrics: attention diversity, which measures the spatial breadth of focus, and attention change rate, which quantifies temporal shifts in attention. Our findings show that curiosity-driven agents display broader, more dynamic attention and exploratory behavior than their extrinsically motivated counterparts. Among them, TransformerRND combines wide attention, high exploration coverage, and compact, structured latent representations. Our results highlight the influence of architectural inductive bias
Authors
(none)
Tags
Stats
Related papers
- Can You See How I Learn? Human Observers' Inferences About Reinforcement Learning Agents' Learning Processes (2025)0.00
- REVEAL-IT: Reinforcement Learning With Visibility Of Evolving Agent Policy For Interpretability (2024)0.00
- Perspectives For Direct Interpretability In Multi-agent Deep Reinforcement Learning (2025)0.00
- Interpretable By Design: Query-specific Neural Modules For Explainable Reinforcement Learning (2025)0.00
- Why The Agent Made That Decision: Contrastive Explanation Learning For Reinforcement Learning (2024)0.00
- A Survey On Interpretable Reinforcement Learning (2021)0.00
- Interestingness Elements For Explainable Reinforcement Learning: Understanding Agents' Capabilities And Limitations (2019)14.55
- Interpretability For Conditional Coordinated Behavior In Multi-agent Reinforcement Learning (2023)3.58