Q-prop: Sample-efficient Policy Gradient With An Off-policy Critic
2016 Β· Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, et al.
Abstract
Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is their high sample complexity. Batch policy gradient methods offer stable learning, but at the cost of high variance, which often requires large batches. TD-style methods, such as off-policy actor-critic and Q-learning, are more sample-efficient but biased, and often require costly hyperparameter sweeps to stabilize. In this work, we aim to develop methods that combine the stability of policy gradients with the efficiency of off-policy RL. We present Q-Prop, a policy gradient method that uses a Taylor expansion of the off-policy critic as a control variate. Q-Prop is both sample efficient and stable, and effectively combines the benefits of on-policy and off-policy methods. We analyze the connection between Q-Prop and existing model-free algorithms, and use control variate theory to derive two variants of Q-Prop wi
Authors
(none)
Tags
Stats
Related papers
- Mitigating Off-policy Bias In Actor-critic Methods With One-step Q-learning: A Novel Correction Approach (2022)0.00
- On-policy Policy Gradient Reinforcement Learning Without On-policy Sampling (2023)0.00
- Combining Policy Gradient And Q-learning (2016)0.00
- Mitigating Suboptimality Of Deterministic Policy Gradients In Complex Q-functions (2024)0.00
- An Approximate Policy Iteration Viewpoint Of Actor-critic Algorithms (2022)2.26
- Quantile-based Deep Reinforcement Learning Using Two-timescale Policy Gradient Algorithms (2023)0.00
- Sample-efficient Model-free Reinforcement Learning With Off-policy Critics (2019)9.60
- Proximal Policy Optimization Algorithms (2017)0.00