Off-policy Actor-critic With Emphatic Weightings
2021 Β· Eric Graves, Ehsan Imani, Raksha Kumaraswamy, et al.
Abstract
A variety of theoretically-sound policy gradient algorithms exist for the on-policy setting due to the policy gradient theorem, which provides a simplified form for the gradient. The off-policy setting, however, has been less clear due to the existence of multiple objectives and the lack of an explicit off-policy policy gradient theorem. In this work, we unify these objectives into one off-policy objective, and provide a policy gradient theorem for this unified objective. The derivation involves emphatic weightings and interest functions. We show multiple strategies to approximate the gradients, in an algorithm called Actor Critic with Emphatic weightings (ACE). We prove in a counterexample that previous (semi-gradient) off-policy actor-critic methods--particularly Off-Policy Actor-Critic (OffPAC) and Deterministic Policy Gradient (DPG)--converge to the wrong solution whereas ACE finds the optimal solution. We also highlight why these semi-gradient approaches can still perform well in
Authors
(none)
Tags
Stats
Related papers
- An Off-policy Policy Gradient Theorem Using Emphatic Weightings (2018)0.00
- Compatible Gradient Approximations For Actor-critic Algorithms (2024)0.00
- Convergent Actor-critic Algorithms Under Off-policy Training And Function Approximation (2018)0.00
- Doubly Robust Off-policy Actor-critic Algorithms For Reinforcement Learning (2019)0.00
- Multi-preference Actor Critic (2019)0.00
- Beyond The Policy Gradient Theorem For Efficient Policy Updates In Actor-critic Algorithms (2022)0.00
- Greedy Actor-critic: A New Conditional Cross-entropy Method For Policy Improvement (2018)0.00
- Provably Convergent Two-timescale Off-policy Actor-critic With Function Approximation (2019)0.00