Entrpo: Trust Region Policy Optimization Method With Entropy Regularization
2021 Β· Sahar Roostaie, Mohammad Mehdi Ebadzadeh
Abstract
Trust Region Policy Optimization (TRPO) is a popular and empirically successful policy search algorithm in reinforcement learning (RL). It iteratively solved the surrogate problem which restricts consecutive policies to be close to each other. TRPO is an on-policy algorithm. On-policy methods bring many benefits, like the ability to gauge each resulting policy. However, they typically discard all the knowledge about the policies which existed before. In this work, we use a replay buffer to borrow from the off-policy learning setting to TRPO. Entropy regularization is usually used to improve policy optimization in reinforcement learning. It is thought to aid exploration and generalization by encouraging more random policy choices. We add an Entropy regularization term to advantage over \{\pi\}, accumulated over time steps, in TRPO. We call this update EnTRPO. Our experiments demonstrate EnTRPO achieves better performance for controlling a Cart-Pole system compared with the original TRPO
Authors
(none)
Tags
Stats
Related papers
- Adaptive Trust Region Policy Optimization: Global Convergence And Faster Rates For Regularized Mdps (2019)12.10
- Simple Policy Optimization (2024)0.00
- Trust-pcl: An Off-policy Trust Region Method For Continuous Control (2017)0.00
- Embedding Safety Into RL: A New Take On Trust Region Methods (2024)0.00
- Arbitrary Entropy Policy Optimization Breaks The Exploration Bottleneck Of Reinforcement Learning (2025)0.00
- Hindsight Trust Region Policy Optimization (2019)0.00
- Multi-agent Trust Region Policy Optimization (2020)12.61
- Policy Optimization With Penalized Point Probability Distance: An Alternative To Proximal Policy Optimization (2018)0.00