Hindsight Trust Region Policy Optimization
2019 Β· Hanbo Zhang, Site Bai, Xuguang Lan, et al.
Abstract
Reinforcement Learning(RL) with sparse rewards is a major challenge. We propose *Hindsight Trust Region Policy Optimization*(HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with *hindsight* to tackle the challenge of sparse rewards. Hindsight refers to the algorithm's ability to learn from information across goals, including ones not intended for the current task. HTRPO leverages two main ideas. It introduces QKL, a quadratic approximation to the KL divergence constraint on the trust region, leading to reduced variance in KL divergence estimation and improved stability in policy update. It also presents Hindsight Goal Filtering(HGF) to select conductive hindsight goals. In experiments, we evaluate HTRPO in various sparse reward tasks, including simple benchmarks, image-based Atari games, and simulated robot control. Ablation studies indicate that QKL and HGF contribute greatly to learning stability and high performance. Comparison results show that in all t
Authors
(none)
Tags
Stats
Related papers
- Trust Region Policy Optimisation In Multi-agent Reinforcement Learning (2021)0.00
- Adaptive Trust Region Policy Optimization: Global Convergence And Faster Rates For Regularized Mdps (2019)12.10
- Embedding Safety Into RL: A New Take On Trust Region Methods (2024)0.00
- Entrpo: Trust Region Policy Optimization Method With Entropy Regularization (2021)0.00
- Simple Policy Optimization (2024)0.00
- Multi-agent Trust Region Policy Optimization (2020)12.61
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00
- Trust-pcl: An Off-policy Trust Region Method For Continuous Control (2017)0.00