Stabilizing Off-policy Training For Long-horizon LLM Agent Via Turn-level Importance Sampling And Clipping-triggered Normalization
2025 Β· Chenliang Li, Adel Elmahdy, Alex Boyd, et al.
Abstract
Reinforcement learning (RL) algorithms such as PPO and GRPO are widely used to train large language models (LLMs) for multi-turn agentic tasks. However, in off-policy training pipelines, these methods often exhibit unstable optimization dynamics and are prone to performance collapse. Through empirical analysis, we identify two fundamental sources of instability in this setting: (1)~a granularity mismatch between token-level policy optimization and turn-structured interactions, and (2) high-variance and unreliable gradient updates induced by off-policy importance sampling and inaccurate advantage estimation. To address these challenges, we propose SORL, \underline\{S\}tabilizing \underline\{O\}ff-Policy \underline\{R\}einforcement \underline\{L\}earning for Long-Horizon Agent Training. SORL introduces principled mechanisms that align policy optimization with the structure of multi-turn interactions and adaptively suppress unreliable off-policy updates, yielding more conservative and rob
Authors
(none)
Tags
Stats
Related papers
- Turn-ppo: Turn-level Advantage Estimation With PPO For Improved Multi-turn RL In Agentic Llms (2025)0.00
- It's Not You, It's Clipping: A Soft Trust-region Via Probability Smoothing For LLM RL (2025)0.00
- OBLR-PO: A Theoretical Framework For Stable Reinforcement Learning (2025)0.00
- GHPO: Adaptive Guidance For Stable And Efficient LLM Reinforcement Learning (2025)0.00
- Stabilizing Reinforcement Learning For Diffusion Language Models (2026)0.00
- Stabilizing Policy Gradients For Sample-efficient Reinforcement Learning In LLM Reasoning (2025)0.00
- EPO: Entropy-regularized Policy Optimization For LLM Agents Reinforcement Learning (2025)0.00
- DISPO: Enhancing Training Efficiency And Stability In Reinforcement Learning For Large Language Model Mathematical Reasoning (2026)0.00