Temporal-difference Value Estimation Via Uncertainty-guided Soft Updates
2021 Β· Litian Liang, Yaosheng Xu, Stephen McAleer, et al.
Abstract
Temporal-Difference (TD) learning methods, such as Q-Learning, have proven effective at learning a policy to perform control tasks. One issue with methods like Q-Learning is that the value update introduces bias when predicting the TD target of a unfamiliar state. Estimation noise becomes a bias after the max operator in the policy improvement step, and carries over to value estimations of other states, causing Q-Learning to overestimate the Q value. Algorithms like Soft Q-Learning (SQL) introduce the notion of a soft-greedy policy, which reduces the estimation bias via soft updates in early stages of training. However, the inverse temperature \(\beta\) that controls the softness of an update is usually set by a hand-designed heuristic, which can be inaccurate at capturing the uncertainty in the target estimate. Under the belief that \(\beta\) is closely related to the (state dependent) model uncertainty, Entropy Regularized Q-Learning (EQL) further introduces a principled scheduling o
Authors
(none)
Tags
Stats
Related papers
- Adaptive Temporal-difference Learning For Policy Evaluation With Per-state Uncertainty Estimates (2019)0.00
- Preferential Temporal Difference Learning (2021)0.00
- The Statistical Benefits Of Quantile Temporal-difference Learning For Value Estimation (2023)0.00
- Discerning Temporal Difference Learning (2023)0.00
- Pseudo-quantized Actor-critic Algorithm For Robustness To Noisy Temporal Difference Error (2026)0.00
- Assumed Density Filtering Q-learning (2017)5.24
- On The Statistical Benefits Of Temporal Difference Learning (2023)0.00
- Gradient Iterated Temporal-difference Learning (2026)0.00