Residual Policy Gradient: A Reward View Of Kl-regularized Objective
2025 Β· Pengcheng Wang, Xinghao Zhu, Yuxin Chen, et al.
Abstract
Reinforcement Learning and Imitation Learning have achieved widespread success in many domains but remain constrained during real-world deployment. One of the main issues is the additional requirements that were not considered during training. To address this challenge, policy customization has been introduced, aiming to adapt a prior policy while preserving its inherent properties and meeting new task-specific requirements. A principled approach to policy customization is Residual Q-Learning (RQL), which formulates the problem as a Markov Decision Process (MDP) and derives a family of value-based learning algorithms. However, RQL has not yet been applied to policy gradient methods, which restricts its applicability, especially in tasks where policy gradient has already proven more effective. In this work, we first derive a concise form of Soft Policy Gradient as a preliminary. Building on this, we introduce Residual Policy Gradient (RPG), which extends RQL to policy gradient methods,
Authors
(none)
Tags
Stats
Related papers
- Rethinking KL Regularization In RLHF: From Value Estimation To Gradient Optimization (2025)0.00
- Why Policy Gradient Algorithms Work For Undiscounted Total-reward Mdps (2025)0.00
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- PC-PG: Policy Cover Directed Exploration For Provable Policy Gradient Learning (2020)0.00
- Learning Optimal Deterministic Policies With Stochastic Policy Gradients (2024)0.00
- Some Remarks On Gradient Dominance And LQR Policy Optimization (2025)0.00
- Model-free Policy Learning With Reward Gradients (2021)0.00
- Advantage Shaping As Surrogate Reward Maximization: Unifying Pass@k Policy Gradients (2025)0.00