Rate Or Fate? Rlv\(^\varepsilon\)r: Reinforcement Learning With Verifiable Noisy Rewards
2026 Β· Ali Rad, Khashayar Filom, Darioush Keivan, et al.
Abstract
Reinforcement learning with verifiable rewards (RLVR) is a simple but powerful paradigm for training LLMs: sample a completion, verify it, and update. In practice, however, the verifier is almost never clean--unit tests probe only limited corner cases; human and synthetic labels are imperfect; and LLM judges (e.g., RLAIF) are noisy and can be exploited--and this problem worsens on harder domains (especially coding) where tests are sparse and increasingly model-generated. We ask a pragmatic question: does the verification noise merely slow down the learning (rate), or can it flip the outcome (fate)? To address this, we develop an analytically tractable multi-armed bandit view of RLVR dynamics, instantiated with GRPO and validated in controlled experiments. Modeling false positives and false negatives and grouping completions into recurring reasoning modes yields a replicator-style (natural-selection) flow on the probability simplex. The dynamics decouples into within-correct-mode comp
Authors
(none)
Tags
Stats
Related papers
- Reinforcement Learning With Verifiable Yet Noisy Rewards Under Imperfect Verifiers (2025)0.00
- Delay, Plateau, Or Collapse: Evaluating The Impact Of Systematic Verification Error On RLVR (2026)0.00
- The Implicit Curriculum: Learning Dynamics In RL With Verifiable Rewards (2026)0.00
- On The Optimization Dynamics Of RLVR: Gradient Gap And Step Size Thresholds (2025)0.00
- Rethinking Entropy Interventions In RLVR: An Entropy Change Perspective (2026)0.00
- Policy Improvement Reinforcement Learning (2026)0.00
- No Prompt Left Behind: Exploiting Zero-variance Prompts In LLM Reinforcement Learning Via Entropy-guided Advantage Shaping (2025)0.00
- Shrinking The Variance: Shrinkage Baselines For Reinforcement Learning With Verifiable Rewards (2025)0.00