The Implicit Curriculum: Learning Dynamics In RL With Verifiable Rewards
2026 Β· Yu Huang, Zixin Wen, Yuejie Chi, et al.
Abstract
arXiv:2602.14872v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models. Yet it remains a mystery how rewards based solely on final outcomes can help overcome the long-horizon barrier to extended reasoning. To understand this, we develop a theory of the training dynamics of RLVR for transformers on compositional reasoning tasks. Our theory shows that mixed-difficulty training naturally follows an implicit curriculum: without any explicit schedule, easier problems become learnable first and shape the frontier for harder ones, creating a learning progression from easy to hard during optimization. The effectiveness of this curriculum is governed by the smoothness of the difficulty spectrum. When the spectrum is smooth, training dynamics enters a well-behaved relay regime, in which persistent gradient signals on easier problems make slightly harder ones tractable and keep training at
Authors
(none)
Tags
Stats
Related papers
- Rate Or Fate? Rlv\(^\varepsilon\)r: Reinforcement Learning With Verifiable Noisy Rewards (2026)0.00
- Policy Improvement Reinforcement Learning (2026)0.00
- On The Optimization Dynamics Of RLVR: Gradient Gap And Step Size Thresholds (2025)0.00
- Reinforcement Learning With Verifiable Yet Noisy Rewards Under Imperfect Verifiers (2025)0.00
- Rethinking Entropy Interventions In RLVR: An Entropy Change Perspective (2026)0.00
- Delay, Plateau, Or Collapse: Evaluating The Impact Of Systematic Verification Error On RLVR (2026)0.00
- No Prompt Left Behind: Exploiting Zero-variance Prompts In LLM Reinforcement Learning Via Entropy-guided Advantage Shaping (2025)0.00
- A Relative-budget Theory For Reinforcement Learning With Verifiable Rewards In Large Language Model Reasoning (2026)0.00