Reinforcement Learning With Verifiable Yet Noisy Rewards Under Imperfect Verifiers
2025 Β· Xin-Qiang Cai, Wei Wang, Feng Liu, et al.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to \(\\{0,1\\}\), but imperfect verifiers inevitably introduce *false negatives* (rejecting correct answers) and *false positives* (accepting incorrect ones). We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates \(\rho_0\) and \(\rho_1\) -- the FP rate and the FN rate, respectively. From this abstraction we derive two lightweight corrections: (i) a *backward* correction that yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, and (ii) a *forward* correction that reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization pipeline, both corrections improve RLVR for math reasoning under
Authors
(none)
Tags
Stats
Related papers
- Rate Or Fate? Rlv\(^\varepsilon\)r: Reinforcement Learning With Verifiable Noisy Rewards (2026)0.00
- Delay, Plateau, Or Collapse: Evaluating The Impact Of Systematic Verification Error On RLVR (2026)0.00
- Policy Improvement Reinforcement Learning (2026)0.00
- On The Optimization Dynamics Of RLVR: Gradient Gap And Step Size Thresholds (2025)0.00
- The Implicit Curriculum: Learning Dynamics In RL With Verifiable Rewards (2026)0.00
- No Prompt Left Behind: Exploiting Zero-variance Prompts In LLM Reinforcement Learning Via Entropy-guided Advantage Shaping (2025)0.00
- Shrinking The Variance: Shrinkage Baselines For Reinforcement Learning With Verifiable Rewards (2025)0.00
- Rethinking Entropy Interventions In RLVR: An Entropy Change Perspective (2026)0.00