Balanced Aggregation: Understanding And Fixing Aggregation Bias In GRPO

Abstract

arXiv:2605.04077v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a central paradigm for improving reasoning and code generation in large language models, and GRPO-style training is widely adopted for its simplicity and effectiveness. However, an important design choice remains underexplored: how token-level policy gradient terms are aggregated within each sampled group. Standard GRPO uses sequence aggregation, while recent work has advocated token aggregation as a better alternative. We show that these two rules induce different optimization biases: token aggregation introduces sign-length coupling, while sequence aggregation implicitly downweights longer responses through sequence-level equal weighting. To address this tension, we propose \textbf\{Balanced Aggregation (BA)\}, a simple drop-in replacement that computes token-level means separately within the positive and negative subsets and then combines them with sequence-count-based w

Balanced Aggregation: Understanding And Fixing Aggregation Bias In GRPO

Abstract

Authors

Tags

Stats

Related papers