Near-future Policy Optimization
2026 Β· Chuanyu Qin, Chenxu Yang, Qingyi Si, et al.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher \(Q\) , more new knowledge to learn) and close enough (lower \(V\) , more readily absorbed) conditions required to maximize the effective learning signal \(\mathcal\{S\} = Q/V\). We propose \textbf\{N\}ear-Future \textbf\{P\}olicy \textbf\{O\}ptimization (\textbf\{NPO\}), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that
Authors
(none)
Tags
Stats
Related papers
- Think Outside The Policy: In-context Steered Policy Optimization (2025)0.00
- Policy Improvement Reinforcement Learning (2026)0.00
- Learning Self-imitating Diverse Policies (2018)0.00
- Relative Entropy Pathwise Policy Optimization (2025)0.00
- PTR-PPO: Proximal Policy Optimization With Prioritized Trajectory Replay (2021)0.00
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00
- Provably Efficient Exploration In Policy Optimization (2019)0.00
- Reward-conditioned Policies (2019)0.00