Trust Region Masking For Long-horizon LLM Reinforcement Learning
2025 Β· Yingru Li, Jiacai Liu, Jiawei Xu, et al.
Abstract
Policy gradient methods for Large Language Models optimize a policy \(\pi_\theta\) via a surrogate objective computed from samples of a rollout policy \(\pi_\{\text\{roll\}\}\). However, modern LLM-RL pipelines suffer from unavoidable implementation divergences -- backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness -- causing off-policy mismatch (\(\pi_\{\text\{roll\}\} \neq \pi_\theta\)) and approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as \(O(T^2)\) with sequence length \(T\), rendering them vacuous for long-horizon tasks. To address this, we derive a family of bounds -- both KL-based and TV-based -- including a Pinsker-Marginal bound (\(O(T^\{3/2\})\)), a Mixed bound (\(O(T)\)), and an Adaptive bound that strictly generalizes the Pinsker-Marginal bound via per-position importance-ratio decomposition. Taking the minimum over all bounds yields th
Authors
(none)
Tags
Stats
Related papers
- It's Not You, It's Clipping: A Soft Trust-region Via Probability Smoothing For LLM RL (2025)0.00
- Stabilizing Off-policy Training For Long-horizon LLM Agent Via Turn-level Importance Sampling And Clipping-triggered Normalization (2025)0.00
- Stabilizing Policy Gradients For Sample-efficient Reinforcement Learning In LLM Reasoning (2025)0.00
- Trust Region Policy Optimisation In Multi-agent Reinforcement Learning (2021)0.00
- Pairwise Proximal Policy Optimization: Harnessing Relative Feedback For LLM Alignment (2023)0.00
- SPG: Sandwiched Policy Gradient For Masked Diffusion Language Models (2025)0.00
- Inpainting-guided Policy Optimization For Diffusion Large Language Models (2025)0.00
- Adaptive Layerwise Perturbation: Unifying Off-policy Corrections For LLM RL (2026)0.00