Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher \(Q\) , more new knowledge to learn) and close enough (lower \(V\) , more readily absorbed) conditions required to maximize the effective learning signal \(\mathcal\{S\} = Q/V\). We propose \textbf\{N\}ear-Future \textbf\{P\}olicy \textbf\{O\}ptimization (\textbf\{NPO\}), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that

Authors

(none)

Tags

  • Exploration

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyqin2026near

Related papers