On The Optimization Dynamics Of RLVR: Gradient Gap And Step Size Thresholds
2025 Β· Joe Suk, Yaqi Duan
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has found significant empirical success. However, a principled understanding of why it works is lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a new quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization impr
Authors
(none)
Tags
Stats
Related papers
- OBLR-PO: A Theoretical Framework For Stable Reinforcement Learning (2025)0.00
- Shrinking The Variance: Shrinkage Baselines For Reinforcement Learning With Verifiable Rewards (2025)0.00
- Rate Or Fate? Rlv\(^\varepsilon\)r: Reinforcement Learning With Verifiable Noisy Rewards (2026)0.00
- The Implicit Curriculum: Learning Dynamics In RL With Verifiable Rewards (2026)0.00
- Reinforcement Learning With Verifiable Yet Noisy Rewards Under Imperfect Verifiers (2025)0.00
- Delay, Plateau, Or Collapse: Evaluating The Impact Of Systematic Verification Error On RLVR (2026)0.00
- No Prompt Left Behind: Exploiting Zero-variance Prompts In LLM Reinforcement Learning Via Entropy-guided Advantage Shaping (2025)0.00
- Rethinking Entropy Interventions In RLVR: An Entropy Change Perspective (2026)0.00