Approximation Benefits Of Policy Gradient Methods With Aggregated States
2020 Β· Daniel Russo
Abstract
Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregated representations, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by \(\epsilon\), the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as \(\epsilon/(1-\gamma)\), where \(\gamma\) is a discount factor. Faced with inherent approximation error, methods that locally optimize the true decision-objective can be far more robust.
Authors
(none)
Tags
Stats
Related papers
- On The Theory Of Policy Gradient Methods: Optimality, Approximation, And Distribution Shift (2019)0.00
- Convergent Actor-critic Algorithms Under Off-policy Training And Function Approximation (2018)0.00
- Policy Gradient In Partially Observable Environments: Approximation And Convergence (2018)0.00
- Off-policy Policy Gradient With State Distribution Correction (2019)0.00
- A Temporal-difference Approach To Policy Gradient Estimation (2022)0.00
- Compatible Gradient Approximations For Actor-critic Algorithms (2024)0.00
- On The Convergence Of Discounted Policy Gradient Methods (2022)0.00
- Adaptive Approximate Policy Iteration (2020)0.00