Explaining An Agent's Future Beliefs Through Temporally Decomposing Future Reward Estimators
2024 Β· Mark Towers, Yali Du, Christopher Freeman, et al.
Abstract
Future reward estimation is a core component of reinforcement learning agents; i.e., Q-value and state-value functions, predicting an agent's sum of future rewards. Their scalar output, however, obfuscates when or what individual future rewards an agent may expect to receive. We address this by modifying an agent's future reward estimator to predict their next N expected rewards, referred to as Temporal Reward Decomposition (TRD). This unlocks novel explanations of agent behaviour. Through TRD we can: estimate when an agent may expect to receive a reward, the value of the reward and the agent's confidence in receiving it; measure an input feature's temporal importance to the agent's action decisions; and predict the influence of different actions on future rewards. Furthermore, we show that DQN agents trained on Atari environments can be efficiently retrained to incorporate TRD with minimal impact on performance.
Authors
(none)
Tags
Stats
Related papers
- Foresee Then Evaluate: Decomposing Value Estimation With Latent Future Prediction (2021)3.58
- Disentangling Dynamics And Returns: Value Function Decomposition With Future Prediction (2019)0.00
- Explainable Reinforcement Learning Via Temporal Policy Decomposition (2025)0.00
- Predicting Future Actions Of Reinforcement Learning Agents (2024)3.58
- Learning Long-term Reward Redistribution Via Randomized Return Decomposition (2021)0.00
- RUDDER: Return Decomposition For Delayed Rewards (2018)0.00
- Distributional Reward Estimation For Effective Multi-agent Deep Reinforcement Learning (2022)0.00
- Explaining Learned Reward Functions With Counterfactual Trajectories (2024)0.00