Shrinking The Variance: Shrinkage Baselines For Reinforcement Learning With Verifiable Rewards
2025 Β· Guanning Zeng, Zhaoyi Zhou, Daman Arora, et al.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for post-training large reasoning models (LRMs) using policy-gradient methods such as GRPO. To stabilize training, these methods typically center trajectory rewards by subtracting the empirical mean reward for each prompt. Statistically, this centering acts as a control variate (baseline), reducing the variance of the policy-gradient estimator. In practice, the mean reward is estimated using per-prompt empirical averages computed from the generations for each prompt in a batch. Motivated by Stein's paradox, we propose shrinkage estimators that combine per-prompt and across-prompt means to improve per-prompt mean estimation accuracy, especially in the low-generation regime typical of RLVR. Theoretically, we construct a shrinkage-based baseline that provably yields lower-variance policy-gradient estimators across algorithms. Our baseline is a drop-in replacement for standard per-prompt mean baselines
Authors
(none)
Tags
Stats
Related papers
- On The Optimization Dynamics Of RLVR: Gradient Gap And Step Size Thresholds (2025)0.00
- No Prompt Left Behind: Exploiting Zero-variance Prompts In LLM Reinforcement Learning Via Entropy-guided Advantage Shaping (2025)0.00
- \(V_{0.5}\): Generalist Value Model As A Prior For Sparse RL Rollouts (2026)0.00
- Reinforcement Learning With Verifiable Yet Noisy Rewards Under Imperfect Verifiers (2025)0.00
- Policy Improvement Reinforcement Learning (2026)0.00
- A Unified Framework For Rethinking Policy Divergence Measures In GRPO (2026)0.00
- Reinforcement Learning With Verifiable Rewards: Grpo's Effective Loss, Dynamics, And Success Amplification (2025)0.00
- Variance Reduction For Policy-gradient Methods Via Empirical Variance Minimization (2022)0.00