Approximate Discounting-free Policy Evaluation From Transient And Recurrent States
2022 Β· Vektor Dewanto, Marcus Gallagher
Abstract
In order to distinguish policies that prescribe good from bad actions in transient states, we need to evaluate the so-called bias of a policy from transient states. However, we observe that most (if not all) works in approximate discounting-free policy evaluation thus far are developed for estimating the bias solely from recurrent states. We therefore propose a system of approximators for the bias (specifically, its relative value) from transient and recurrent states. Its key ingredient is a seminorm LSTD (least-squares temporal difference), for which we derive its minimizer expression that enables approximation by sampling required in model-free reinforcement learning. This seminorm LSTD also facilitates the formulation of a general unifying procedure for LSTD-based policy value approximators. Experimental results validate the effectiveness of our proposed method.
Authors
(none)
Tags
Stats
Related papers
- Adaptive Temporal-difference Learning For Policy Evaluation With Per-state Uncertainty Estimates (2019)0.00
- A Temporal-difference Approach To Policy Gradient Estimation (2022)0.00
- Parameter-free Reduction Of The Estimation Bias In Deep Reinforcement Learning For Deterministic Policy Gradients (2021)0.00
- A Nearly Blackwell-optimal Policy Gradient Method (2021)0.00
- Revisiting Estimation Bias In Policy Gradients For Deep Reinforcement Learning (2023)0.00
- Accelerated And Instance-optimal Policy Evaluation With Linear Function Approximation (2021)0.00
- Approximate Temporal Difference Learning Is A Gradient Descent For Reversible Policies (2018)0.00
- Doubly Optimal Policy Evaluation For Reinforcement Learning (2024)0.00