Policy Improvement Reinforcement Learning
2026 Β· Huaiyang Wang, Xiaojie Li, Deqing Wang, et al.
Abstract
arXiv:2604.00860v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design -- updating in isolation at each step, guided only by within-group (batch) reward signals -- means optimization can drift or collapse with no mechanism to detect and correct these failures. We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterat
Authors
(none)
Tags
Stats
Related papers
- Reinforcement Learning With Verifiable Yet Noisy Rewards Under Imperfect Verifiers (2025)0.00
- Near-future Policy Optimization (2026)0.00
- Adapt To Thrive! Adaptive Power-mean Policy Optimization For Improved LLM Reasoning (2026)0.00
- Think Outside The Policy: In-context Steered Policy Optimization (2025)0.00
- The Implicit Curriculum: Learning Dynamics In RL With Verifiable Rewards (2026)0.00
- When Errors Can Be Beneficial: A Categorization Of Imperfect Rewards For Policy Gradient (2026)0.00
- GHPO: Adaptive Guidance For Stable And Efficient LLM Reinforcement Learning (2025)0.00
- Uniform-correct Policy Optimization: Breaking Rlvr's Indifference To Diversity (2026)0.00