In Hindsight: A Smooth Reward For Steady Exploration
2019 Β· Hadi S. Jomaa, Josif Grabocka, Lars Schmidt-Thieme
Abstract
In classical Q-learning, the objective is to maximize the sum of discounted rewards through iteratively using the Bellman equation as an update, in an attempt to estimate the action value function of the optimal policy. Conventionally, the loss function is defined as the temporal difference between the action value and the expected (discounted) reward, however it focuses solely on the future, leading to overestimation errors. We extend the well-established Q-learning techniques by introducing the hindsight factor, an additional loss term that takes into account how the model progresses, by integrating the historic temporal difference as part of the reward. The effect of this modification is examined in a deterministic continuous-state space function estimation problem, where the overestimation phenomenon is significantly reduced and results in improved stability. The underlying effect of the hindsight factor is modeled as an adaptive learning rate, which unlike existing adaptive optimi
Authors
(none)
Tags
Stats
Related papers
- Hindsight Policy Gradients (2017)0.00
- Hindsight Value Function For Variance Reduction In Stochastic Dynamic Environment (2021)2.26
- Seizing Serendipity: Exploiting The Value Of Past Success In Off-policy Actor-critic (2023)0.00
- Learning Successor States And Goal-dependent Values: A Mathematical Viewpoint (2021)0.00
- An Information-theoretic Optimality Principle For Deep Reinforcement Learning (2017)0.00
- Exploration Versus Exploitation In Reinforcement Learning: A Stochastic Control Approach (2018)9.76
- Hindsight Priors For Reward Learning From Human Preferences (2024)0.00
- Exploration-exploitation In Multi-agent Competition: Convergence With Bounded Rationality (2021)0.00