Advantage Shaping As Surrogate Reward Maximization: Unifying Pass@k Policy Gradients
2025 Β· Christos Thrampoulidis, Sadegh Mahdavi, Wenlong Deng
Abstract
This note reconciles two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards: (1) direct REINFORCE-style methods, and (2) advantage-shaping techniques that directly modify GRPO. We show that these are two sides of the same coin. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical "hard-example up-weighting" modifications to GRPO as reward-level regularization. Conversely, starting from surrogate reward objectives, we provide a simple recipe for deriving both existing and new advantage-shaping methods. This perspective provides a lens for RLVR policy gradient optimization beyond our original motivation of Pass@K.
Authors
(none)
Tags
Stats
Related papers
- Residual Policy Gradient: A Reward View Of Kl-regularized Objective (2025)0.00
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- Reinforcement Learning With Verifiable Rewards: Grpo's Effective Loss, Dynamics, And Success Amplification (2025)0.00
- Proximal Policy Optimization Algorithms (2017)0.00
- Rethinking KL Regularization In RLHF: From Value Estimation To Gradient Optimization (2025)0.00
- ORSO: Accelerating Reward Design Via Online Reward Selection And Policy Optimization (2024)0.00
- A General Class Of Surrogate Functions For Stable And Efficient Reinforcement Learning (2021)0.00
- Policy Gradient Algorithms Implicitly Optimize By Continuation (2023)0.00