Uniform-correct Policy Optimization: Breaking Rlvr's Indifference To Diversity
2026 Β· Anamika Lochab, Bolian Li, Ruqi Zhang
Abstract
arXiv:2605.00365v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degradation: common RLVR objectives, such as GRPO, are indifferent to how probability mass is distributed among correct solutions. Combined with stochastic training dynamics, this indifference induces a self-reinforcing collapse, in which probability mass concentrates on a narrow subset of correct outputs while alternative valid solutions are suppressed. We formalize this collapse mechanism and further characterize the optimal policy structure under two complementary criteria: robustness and entropy-regularized optimality, which identify the Uniform-Correct Policy as uniquely optimal. Motivated by this analysis, we propose Uniform-Correct Policy Optimization (UCPO), a
Authors
(none)
Tags
Stats
Related papers
- EP-GRPO: Entropy-progress Aligned Group Relative Policy Optimization With Implicit Process Guidance (2026)0.00
- Policy Improvement Reinforcement Learning (2026)0.00
- DVPO: Distributional Value Modeling-based Policy Optimization For LLM Post-training (2026)0.00
- Arbitrary Entropy Policy Optimization Breaks The Exploration Bottleneck Of Reinforcement Learning (2025)0.00
- DGPO: Discovering Multiple Strategies With Diversity-guided Policy Optimization (2022)2.26
- A Unified Framework For Rethinking Policy Divergence Measures In GRPO (2026)0.00
- Think Outside The Policy: In-context Steered Policy Optimization (2025)0.00
- Polychromic Objectives For Reinforcement Learning (2026)0.00