Group-relative REINFORCE Is Secretly An Off-policy Algorithm: Demystifying Some Myths About GRPO And Its Friends
2025 Β· Chaorui Yao, Yanxi Chen, Yuchang Sun, et al.
Abstract
Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for group-relative REINFORCE -- a REINFORCE variant that uses the within-group mean reward as the baseline for advantage calculation -- without assuming a specific training data distribution, showing that it admits a native off-policy interpretation. This perspective yields two general principles for adapting REINFORCE to truly off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling
Authors
(none)
Tags
Stats
Related papers
- Noise-corrected GRPO: From Noisy Rewards To Unbiased Gradients (2025)0.00
- NGRPO: Negative-enhanced Group Relative Policy Optimization (2025)0.00
- EP-GRPO: Entropy-progress Aligned Group Relative Policy Optimization With Implicit Process Guidance (2026)0.00
- TIC-GRPO: Provable And Efficient Optimization For Reinforcement Learning From Human Feedback (2025)0.00
- Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning In GRPO (2025)0.00
- Arbitrary Entropy Policy Optimization Breaks The Exploration Bottleneck Of Reinforcement Learning (2025)0.00
- Reinforcement Learning With Verifiable Rewards: Grpo's Effective Loss, Dynamics, And Success Amplification (2025)0.00
- Think Outside The Policy: In-context Steered Policy Optimization (2025)0.00