Rethinking KL Regularization In RLHF: From Value Estimation To Gradient Optimization
2025 Β· Kezhao Liu, Jason Klein Liu, Mingtao Chen, et al.
Abstract
Reinforcement Learning from Human Feedback (RLHF) leverages a Kullback-Leibler (KL) divergence loss to stabilize training and prevent overfitting. However, in methods such as GRPO, its implementation may be guided by principles from numerical value estimation-a practice that overlooks the term's functional role as an optimization loss. To analyze this issue, we establish a unified framework that connects two seemingly distinct implementation styles: using the mathematical term \(k_n\) as a detached coefficient for the policy's score function ('\(k_n\) in reward') or as a direct loss function through which gradients are propagated ('\(k_n\) as loss'). We show that the latter can always be analyzed via an equivalent gradient coefficient in the former, unifying the two perspectives. Through this framework, we prove that the conventional '\(k_1\) in reward' (like in PPO) is the principled loss for Reverse KL (RKL) regularization. We further establish a key finding: under on-policy conditio
Authors
(none)
Tags
Stats
Related papers
- Leverage The Average: An Analysis Of KL Regularization In RL (2020)0.00
- Residual Policy Gradient: A Reward View Of Kl-regularized Objective (2025)0.00
- A Unified Framework For Rethinking Policy Divergence Measures In GRPO (2026)0.00
- Principled Reinforcement Learning With Human Feedback From Pairwise Or \(k\)-wise Comparisons (2023)0.00
- Sharp Analysis For Kl-regularized Contextual Bandits And RLHF (2024)0.00
- Reinforcement Learning With Verifiable Rewards: Grpo's Effective Loss, Dynamics, And Success Amplification (2025)0.00
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00
- Zeroth-order Policy Gradient For Reinforcement Learning From Human Feedback Without Reward Inference (2024)0.00