Behind The Myth Of Exploration In Policy Gradients
2024 Β· Adrien Bolland, Gaspard Lambrechts, Damien Ernst
Abstract
In order to compute near-optimal policies with policy-gradient algorithms, it is common in practice to include intrinsic exploration terms in the learning objective. Although the effectiveness of these terms is usually justified by an intrinsic need to explore environments, we propose a novel analysis with the lens of numerical optimization. Two criteria are introduced on the learning objective and two others on its stochastic gradient estimates, and are afterwards used to discuss the quality of the policy after optimization. The analysis sheds light on two separate effects of exploration techniques. First, they make it possible to smooth the learning objective and to eliminate local optima while preserving the global maximum. Second, they modify the gradient estimates, increasing the probability that the stochastic parameter updates eventually provide an optimal policy. We empirically illustrate these effects with exploration strategies based on entropy bonuses, identifying limitation
Authors
(none)
Tags
Stats
Related papers
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- Curious Explorer: A Provable Exploration Strategy In Policy Learning (2021)0.00
- Learning To Explore With Meta-policy Gradient (2018)0.00
- Policy Gradient Algorithms Implicitly Optimize By Continuation (2023)0.00
- Exploration Conscious Reinforcement Learning Revisited (2018)0.00
- Intrinsic Reward Policy Optimization For Sparse-reward Environments (2026)0.00
- Policy Gradient From Demonstration And Curiosity (2020)0.00
- PC-PG: Policy Cover Directed Exploration For Provable Policy Gradient Learning (2020)0.00