Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to \(\\{0,1\\}\), but imperfect verifiers inevitably introduce *false negatives* (rejecting correct answers) and *false positives* (accepting incorrect ones). We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates \(\rho_0\) and \(\rho_1\) -- the FP rate and the FN rate, respectively. From this abstraction we derive two lightweight corrections: (i) a *backward* correction that yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, and (ii) a *forward* correction that reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization pipeline, both corrections improve RLVR for math reasoning under

Authors

(none)

Tags

  • Uncategorized

Stats

Related papers