Rethinking KL Regularization In RLHF: From Value Estimation To Gradient Optimization

Abstract

Reinforcement Learning from Human Feedback (RLHF) leverages a Kullback-Leibler (KL) divergence loss to stabilize training and prevent overfitting. However, in methods such as GRPO, its implementation may be guided by principles from numerical value estimation-a practice that overlooks the term's functional role as an optimization loss. To analyze this issue, we establish a unified framework that connects two seemingly distinct implementation styles: using the mathematical term \(k_n\) as a detached coefficient for the policy's score function ('\(k_n\) in reward') or as a direct loss function through which gradients are propagated ('\(k_n\) as loss'). We show that the latter can always be analyzed via an equivalent gradient coefficient in the former, unifying the two perspectives. Through this framework, we prove that the conventional '\(k_1\) in reward' (like in PPO) is the principled loss for Reverse KL (RKL) regularization. We further establish a key finding: under on-policy conditio

Rethinking KL Regularization In RLHF: From Value Estimation To Gradient Optimization

Abstract

Authors

Tags

Stats

Related papers