On-policy Policy Gradient Reinforcement Learning Without On-policy Sampling
2023 Β· Nicholas E. Corrado, Josiah P. Hanna
Abstract
On-policy reinforcement learning (RL) algorithms are typically characterized as algorithms that perform policy updates using i.i.d. trajectories collected by the agent's current policy. However, after observing only a finite number of trajectories, such on-policy sampling may produce data that fails to match the expected on-policy data distribution. This sampling error leads to high-variance gradient estimates that yield data-inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error w.r.t. the expected on-policy distribution than on-policy sampling can produce (Zhong et. al, 2022). Motivated by this observation, we introduce an adaptive, off-policy sampling method to reduce sampling error during on-policy policy gradient RL training. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy that increases the probabi
Authors
(none)
Tags
Stats
Related papers
- Robust On-policy Sampling For Data-efficient Policy Evaluation In Reinforcement Learning (2021)0.00
- Batch Reinforcement Learning With A Nonparametric Off-policy Policy Gradient (2020)0.00
- Q-prop: Sample-efficient Policy Gradient With An Off-policy Critic (2016)0.00
- Off-policy Policy Gradient Algorithms By Constraining The State Distribution Shift (2019)0.00
- Behaviour Policy Optimization: Provably Lower Variance Return Estimates For Off-policy Reinforcement Learning (2025)0.00
- Model-free Policy Learning With Reward Gradients (2021)0.00
- Proximal Policy Optimization Algorithms (2017)0.00
- Doubly Robust Off-policy Value And Gradient Estimation For Deterministic Policies (2020)0.00