Near-future Policy Optimization

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher \(Q\) , more new knowledge to learn) and close enough (lower \(V\) , more readily absorbed) conditions required to maximize the effective learning signal \(\mathcal\{S\} = Q/V\). We propose \textbf\{N\}ear-Future \textbf\{P\}olicy \textbf\{O\}ptimization (\textbf\{NPO\}), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that

Near-future Policy Optimization

Abstract

Authors

Tags

Stats

Related papers