Behaviour Policy Optimization: Provably Lower Variance Return Estimates For Off-policy Reinforcement Learning
2025 Β· Alexander W. Goodall, Edwin Hamel-de Le Court, Francesco Belardinelli
Abstract
Many reinforcement learning algorithms, particularly those that rely on return estimates for policy improvement, can suffer from poor sample efficiency and training instability due to high-variance return estimates. In this paper we leverage new results from off-policy evaluation; it has recently been shown that well-designed behaviour policies can be used to collect off-policy data for provably lower variance return estimates. This result is surprising as it means collecting data on-policy is not variance optimal. We extend this key insight to the online reinforcement learning setting, where both policy evaluation and improvement are interleaved to learn optimal policies. Off-policy RL has been well studied (e.g., IMPALA), with correct and truncated importance weighted samples for de-biasing and managing variance appropriately. Generally these approaches are concerned with reconciling data collected from multiple workers in parallel, while the policy is updated asynchronously, mismatc
Authors
(none)
Tags
Stats
Related papers
- Efficient Policy Evaluation With Safety Constraint For Reinforcement Learning (2024)0.00
- Near-optimal Offline Reinforcement Learning Via Double Variance Reduction (2021)0.00
- Doubly Optimal Policy Evaluation For Reinforcement Learning (2024)0.00
- Low Variance Off-policy Evaluation With State-based Importance Sampling (2022)0.00
- Robust On-policy Sampling For Data-efficient Policy Evaluation In Reinforcement Learning (2021)0.00
- Statistically Efficient Variance Reduction With Double Policy Estimation For Off-policy Evaluation In Sequence-modeled Reinforcement Learning (2023)0.00
- Sample Dropout: A Simple Yet Effective Variance Reduction Technique In Deep Policy Optimization (2023)0.00
- Conservative Exploration For Policy Optimization Via Off-policy Policy Evaluation (2023)0.00