Abstract

Policy gradient methods for Large Language Models optimize a policy \(\pi_\theta\) via a surrogate objective computed from samples of a rollout policy \(\pi_\{\text\{roll\}\}\). However, modern LLM-RL pipelines suffer from unavoidable implementation divergences -- backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness -- causing off-policy mismatch (\(\pi_\{\text\{roll\}\} \neq \pi_\theta\)) and approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as \(O(T^2)\) with sequence length \(T\), rendering them vacuous for long-horizon tasks. To address this, we derive a family of bounds -- both KL-based and TV-based -- including a Pinsker-Marginal bound (\(O(T^\{3/2\})\)), a Mixed bound (\(O(T)\)), and an Adaptive bound that strictly generalizes the Pinsker-Marginal bound via per-position importance-ratio decomposition. Taking the minimum over all bounds yields th

Authors

(none)

Tags

  • Policy Gradient

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyli2025trust

Related papers