A Unified Framework For Rethinking Policy Divergence Measures In GRPO
2026 Β· Qingyuan Wu, Yuhui Wang, Simon Sinong Zhan, et al.
Abstract
Reinforcement Learning with Verified Reward (RLVR) has emerged as a critical paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). Most existing RLVR methods, such as GRPO and its variants, ensure stable updates by constraining policy divergence through clipping likelihood ratios. This paper introduces a unified clipping framework that characterizes existing methods via a general notion of policy divergence, encompassing both likelihood ratios and Kullback-Leibler (KL) divergences and extending to alternative measures. The framework provides a principled foundation for systematically analyzing how different policy divergence measures affect exploration and performance. We further identify the KL3 estimator, a variance-reduced Monte Carlo estimator of the KL divergence, as a key policy divergence constraint. We theoretically demonstrate that the KL3-based constraint is mathematically equivalent to an asymmetric ratio-based clipping that reallocates probabili
Authors
(none)
Tags
Stats
Related papers
- Reinforcement Learning With Verifiable Rewards: Grpo's Effective Loss, Dynamics, And Success Amplification (2025)0.00
- Rethinking KL Regularization In RLHF: From Value Estimation To Gradient Optimization (2025)0.00
- Beyond KL Divergence: Policy Optimization With Flexible Bregman Divergences For LLM Reasoning (2026)0.00
- Stabilizing Reinforcement Learning For Diffusion Language Models (2026)0.00
- EP-GRPO: Entropy-progress Aligned Group Relative Policy Optimization With Implicit Process Guidance (2026)0.00
- Shrinking The Variance: Shrinkage Baselines For Reinforcement Learning With Verifiable Rewards (2025)0.00
- Think Outside The Policy: In-context Steered Policy Optimization (2025)0.00
- Uniform-correct Policy Optimization: Breaking Rlvr's Indifference To Diversity (2026)0.00