EP-GRPO: Entropy-progress Aligned Group Relative Policy Optimization With Implicit Process Guidance
2026 Β· Song Yu, Li Li, Wenwen Zhao, et al.
Abstract
arXiv:2605.04960v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity that ignores heterogeneous informational value, uniform polarity that penalizes correct steps and rewards incorrect ones, and zero-variance collapse that erases outcome-driven gradients. We systematically quantify these failures, revealing highly non-uniform token informativeness, widespread step-level polarity misalignment, and substantial training waste. To address these limitations, we propose Entropy-Progress Aligned GRPO (EP-GRPO), a framework that mines the model's intrinsic information flow for dense, self-supervised guidance. EP-GRPO integrates entropy-gated modulation to prioritize high entropy decision pivots, implicit process signals from policy divergence anchored to outcome advantages for direct
Authors
(none)
Tags
Stats
Related papers
- Arbitrary Entropy Policy Optimization Breaks The Exploration Bottleneck Of Reinforcement Learning (2025)0.00
- TIC-GRPO: Provable And Efficient Optimization For Reinforcement Learning From Human Feedback (2025)0.00
- NGRPO: Negative-enhanced Group Relative Policy Optimization (2025)0.00
- AEGPO: Adaptive Entropy-guided Policy Optimization For Diffusion Models (2026)0.00
- Reinforcement Learning With Verifiable Rewards: Grpo's Effective Loss, Dynamics, And Success Amplification (2025)0.00
- Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning In GRPO (2025)0.00
- Noise-corrected GRPO: From Noisy Rewards To Unbiased Gradients (2025)0.00
- Balanced Aggregation: Understanding And Fixing Aggregation Bias In GRPO (2026)0.00