The Uncertainty Bellman Equation And Exploration
2017 Β· Brendan O'Donoghue, Ian Osband, Remi Munos, et al.
Abstract
We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar \textit\{uncertainty\} Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the posterior distribution of the Q-values induced by any policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for \(\epsilon\)-greedy improves DQN p
Authors
(none)
Tags
Stats
Related papers
- Uncertainty Quantification And Exploration For Reinforcement Learning (2019)6.77
- Model-based Epistemic Variance Of Values For Risk-aware Policy Optimization (2023)0.00
- Temporal Difference Uncertainties As A Signal For Exploration (2020)0.00
- Efficient Exploration With Double Uncertain Value Networks (2017)0.00
- Smart Exploration In Reinforcement Learning Using Bounded Uncertainty Models (2025)0.00
- Exploration Via Epistemic Value Estimation (2023)2.26
- Learning Near Optimal Policies With Low Inherent Bellman Error (2020)0.00
- Careful At Estimation And Bold At Exploration (2023)0.00