Off-oab: Off-policy Policy Gradient Method With Optimal Action-dependent Baseline
2024 Β· Wenjia Meng, Qian Zheng, Long Yang, et al.
Abstract
Policy-based methods have achieved remarkable success in solving challenging reinforcement learning problems. Among these methods, off-policy policy gradient methods are particularly important due to that they can benefit from off-policy data. However, these methods suffer from the high variance of the off-policy policy gradient (OPPG) estimator, which results in poor sample efficiency during training. In this paper, we propose an off-policy policy gradient method with the optimal action-dependent baseline (Off-OAB) to mitigate this variance issue. Specifically, this baseline maintains the OPPG estimator's unbiasedness while theoretically minimizing its variance. To enhance practical computational efficiency, we design an approximated version of this optimal baseline. Utilizing this approximation, our method (Off-OAB) aims to decrease the OPPG estimator's variance during policy optimization. We evaluate the proposed Off-OAB method on six representative tasks from OpenAI Gym and MuJoCo,
Authors
(none)
Tags
Stats
Related papers
- Variance Reduction For Policy Gradient With Action-dependent Factorized Baselines (2018)0.00
- Batch Reinforcement Learning With A Nonparametric Off-policy Policy Gradient (2020)0.00
- Off-policy Policy Gradient With State Distribution Correction (2019)0.00
- Action-depedent Control Variates For Policy Optimization Via Stein's Identity (2017)0.00
- Behaviour Policy Optimization: Provably Lower Variance Return Estimates For Off-policy Reinforcement Learning (2025)0.00
- Proximal Policy Optimization Algorithms (2017)0.00
- Semi-on-policy Training For Sample Efficient Multi-agent Policy Gradients (2021)0.00
- Policy Gradient Methods For Reinforcement Learning With Function Approximation And Action-dependent Baselines (2017)0.00