Improving Policy Gradient By Exploring Under-appreciated Rewards
2016 Β· Ofir Nachum, Mohammad Norouzi, Dale Schuurmans
Abstract
This paper presents a novel form of policy gradient for model-free reinforcement learning (RL) with improved exploration properties. Current policy-based methods use entropy regularization to encourage undirected exploration of the reward landscape, which is ineffective in high dimensional spaces with sparse rewards. We propose a more directed exploration strategy that promotes exploration of under-appreciated reward regions. An action sequence is considered under-appreciated if its log-probability under the current policy under-estimates its resulting reward. The proposed exploration strategy is easy to implement, requiring small modifications to an implementation of the REINFORCE algorithm. We evaluate the approach on a set of algorithmic tasks that have long challenged RL methods. Our approach reduces hyper-parameter sensitivity and demonstrates significant improvements over baseline methods. Our algorithm successfully solves a benchmark multi-digit addition task and generalizes to
Authors
(none)
Tags
Stats
Related papers
- Policy Gradient From Demonstration And Curiosity (2020)0.00
- Model-free Policy Learning With Reward Gradients (2021)0.00
- Behind The Myth Of Exploration In Policy Gradients (2024)0.00
- Intrinsic Reward Policy Optimization For Sparse-reward Environments (2026)0.00
- The Reinforce Policy Gradient Algorithm Revisited (2023)0.00
- Minimax-optimal Reward-agnostic Exploration In Reinforcement Learning (2023)0.00
- PC-PG: Policy Cover Directed Exploration For Provable Policy Gradient Learning (2020)0.00
- S-REINFORCE: A Neuro-symbolic Policy Gradient Approach For Interpretable Reinforcement Learning (2023)0.00