Intrinsic Reward Policy Optimization For Sparse-reward Environments
2026 Β· Minjae Cho, Huy Trong Tran
Abstract
Exploration is essential in reinforcement learning as an agent relies on trial and error to learn an optimal policy. However, when rewards are sparse, naive exploration strategies, like noise injection, are often insufficient. Intrinsic rewards can also provide principled guidance for exploration by, for example, combining them with extrinsic rewards to optimize a policy or using them to train subpolicies for hierarchical learning. However, the former approach suffers from unstable credit assignment, while the latter exhibits sample inefficiency and sub-optimality. We propose a policy optimization framework that leverages multiple intrinsic rewards to directly optimize a policy for an extrinsic reward without pretraining subpolicies. Our algorithm -- intrinsic reward policy optimization (IRPO) -- achieves this by using a surrogate policy gradient that provides a more informative learning signal than the true gradient in sparse-reward environments. We demonstrate that IRPO improves perf
Authors
(none)
Tags
Stats
Related papers
- Redeeming Intrinsic Rewards Via Constrained Optimization (2022)0.00
- On Learning Intrinsic Rewards For Policy Gradient Methods (2018)0.00
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- Coordinated Exploration Via Intrinsic Rewards For Multi-agent Reinforcement Learning (2019)0.00
- Information Content Exploration (2023)0.00
- Continuously Discovering Novel Strategies Via Reward-switching Policy Optimization (2022)0.00
- The Impact Of Intrinsic Rewards On Exploration In Reinforcement Learning (2025)0.00
- Think Outside The Policy: In-context Steered Policy Optimization (2025)0.00