Trust Region Masking For Long-horizon LLM Reinforcement Learning

Abstract

Policy gradient methods for Large Language Models optimize a policy \(\pi_\theta\) via a surrogate objective computed from samples of a rollout policy \(\pi_\{\text\{roll\}\}\). However, modern LLM-RL pipelines suffer from unavoidable implementation divergences -- backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness -- causing off-policy mismatch (\(\pi_\{\text\{roll\}\} \neq \pi_\theta\)) and approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as \(O(T^2)\) with sequence length \(T\), rendering them vacuous for long-horizon tasks. To address this, we derive a family of bounds -- both KL-based and TV-based -- including a Pinsker-Marginal bound (\(O(T^\{3/2\})\)), a Mixed bound (\(O(T)\)), and an Adaptive bound that strictly generalizes the Pinsker-Marginal bound via per-position importance-ratio decomposition. Taking the minimum over all bounds yields th

Trust Region Masking For Long-horizon LLM Reinforcement Learning

Abstract

Authors

Tags

Stats

Related papers