Value Improved Actor Critic Algorithms
2024 Β· Yaniv Oren, Moritz A. Zanger, Pascal R. van Der Vaart, et al.
Abstract
To learn approximately optimal acting policies for decision problems, modern Actor Critic algorithms rely on deep Neural Networks (DNNs) to parameterize the acting policy and greedification operators to iteratively improve it. The reliance on DNNs suggests an improvement that is gradient based, which is per step much less greedy than the improvement possible by greedier operators such as the greedy update used by Q-learning algorithms. On the other hand, slow changes to the policy can also be beneficial for the stability of the learning process, resulting in a tradeoff between greedification and stability. To better address this tradeoff, we propose to decouple the acting policy from the policy evaluated by the critic. This allows the agent to separately improve the critic's policy (e.g. value improvement) with greedier updates while maintaining the slow gradient-based improvement to the parameterized acting policy. We investigate the convergence of this approach using the popular anal
Authors
(none)
Tags
Stats
Related papers
- How To Learn A Useful Critic? Model-based Action-gradient-estimator Policy Optimization (2020)0.00
- Beyond The Policy Gradient Theorem For Efficient Policy Updates In Actor-critic Algorithms (2022)0.00
- Improving Actor-critic Training With Steerable Action-value Approximation Errors (2024)0.00
- Mitigating Suboptimality Of Deterministic Policy Gradients In Complex Q-functions (2024)0.00
- Neural Policy Gradient Methods: Global Optimality And Rates Of Convergence (2019)0.00
- Greedy Actor-critic: A New Conditional Cross-entropy Method For Policy Improvement (2018)0.00
- Neural Network Compatible Off-policy Natural Actor-critic Algorithm (2021)0.00
- Recursive Least Squares Advantage Actor-critic Algorithms (2022)0.00