Average-reward Reinforcement Learning With Trust Region Methods
2021 Β· Xiaoteng Ma, Xiaohang Tang, Li Xia, et al.
Abstract
Most of reinforcement learning algorithms optimize the discounted criterion which is beneficial to accelerate the convergence and reduce the variance of estimates. Although the discounted criterion is appropriate for certain tasks such as financial related problems, many engineering problems treat future rewards equally and prefer a long-run average criterion. In this paper, we study the reinforcement learning problem with the long-run average criterion. Firstly, we develop a unified trust region theory with discounted and average criteria and derive a novel performance bound within the trust region with the Perturbation Analysis (PA) theory. Secondly, we propose a practical algorithm named Average Policy Optimization (APO), which improves the value estimation with a novel technique named Average Value Constraint. Finally, experiments are conducted in the continuous control environment MuJoCo. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiven
Authors
(none)
Tags
Stats
Related papers
- Performance Bounds For Policy-based Average Reward Reinforcement Learning Algorithms (2023)2.26
- ACPO: A Policy Optimization Algorithm For Average Mdps With Constraints (2023)0.00
- Value Enhancement Of Reinforcement Learning Via Efficient And Robust Trust Region Optimization (2023)0.00
- Absolute Policy Optimization (2023)0.00
- Adaptive Trust Region Policy Optimization: Global Convergence And Faster Rates For Regularized Mdps (2019)12.10
- Examining Average And Discounted Reward Optimality Criteria In Reinforcement Learning (2021)0.00
- Trust-pcl: An Off-policy Trust Region Method For Continuous Control (2017)0.00
- Trust Region Policy Optimisation In Multi-agent Reinforcement Learning (2021)0.00