Balanced Aggregation: Understanding And Fixing Aggregation Bias In GRPO
2026 Β· Zhiyuan Zeng, Jiameng Huang, Zhangyue Yin, et al.
Abstract
arXiv:2605.04077v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a central paradigm for improving reasoning and code generation in large language models, and GRPO-style training is widely adopted for its simplicity and effectiveness. However, an important design choice remains underexplored: how token-level policy gradient terms are aggregated within each sampled group. Standard GRPO uses sequence aggregation, while recent work has advocated token aggregation as a better alternative. We show that these two rules induce different optimization biases: token aggregation introduces sign-length coupling, while sequence aggregation implicitly downweights longer responses through sequence-level equal weighting. To address this tension, we propose \textbf\{Balanced Aggregation (BA)\}, a simple drop-in replacement that computes token-level means separately within the positive and negative subsets and then combines them with sequence-count-based w
Authors
(none)
Tags
Stats
Related papers
- EP-GRPO: Entropy-progress Aligned Group Relative Policy Optimization With Implicit Process Guidance (2026)0.00
- NGRPO: Negative-enhanced Group Relative Policy Optimization (2025)0.00
- Noise-corrected GRPO: From Noisy Rewards To Unbiased Gradients (2025)0.00
- MMR-GRPO: Accelerating Grpo-style Training Through Diversity-aware Reward Reweighting (2026)0.00
- Reinforcement Learning With Verifiable Rewards: Grpo's Effective Loss, Dynamics, And Success Amplification (2025)0.00
- TIC-GRPO: Provable And Efficient Optimization For Reinforcement Learning From Human Feedback (2025)0.00
- Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning In GRPO (2025)0.00
- Group-relative REINFORCE Is Secretly An Off-policy Algorithm: Demystifying Some Myths About GRPO And Its Friends (2025)0.00