Polychromic Objectives For Reinforcement Learning
2026 Β· Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, et al.
Abstract
arXiv:2509.25424v5 Announce Type: replace-cross Abstract: Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks. These pretrained policies, trained on large datasets, produce generations with a broad range of promising but unrefined behaviors. Often, a critical failure mode of RLFT arises when policies lose this diversity and collapse into a handful of easily exploitable outputs. This convergence hinders exploration, which is essential for expanding the capabilities of the pretrained policy and for amplifying the benefits of test-time compute scaling. To address this, we introduce an objective for policy gradient methods that explicitly enforces the exploration and refinement of diverse generations, which we call a polychromic objective. We then show how proximal policy optimization (PPO) can be adapted to optimize this objective. Our method (1) employs vine sampling to collect on-policy rollouts and (2) modifies the ad
Authors
(none)
Tags
Stats
Related papers
- Proximal Policy Optimization Algorithms (2017)0.00
- Think Outside The Policy: In-context Steered Policy Optimization (2025)0.00
- Near-future Policy Optimization (2026)0.00
- ANO: A Principled Approach To Robust Policy Optimization (2026)0.00
- Uniform-correct Policy Optimization: Breaking Rlvr's Indifference To Diversity (2026)0.00
- Policy Improvement Reinforcement Learning (2026)0.00
- DVPO: Distributional Value Modeling-based Policy Optimization For LLM Post-training (2026)0.00
- Unified Policy Optimization For Continuous-action Reinforcement Learning In Non-stationary Tasks And Games (2022)2.26