Stabilizing Off-policy Training For Long-horizon LLM Agent Via Turn-level Importance Sampling And Clipping-triggered Normalization

Abstract

Reinforcement learning (RL) algorithms such as PPO and GRPO are widely used to train large language models (LLMs) for multi-turn agentic tasks. However, in off-policy training pipelines, these methods often exhibit unstable optimization dynamics and are prone to performance collapse. Through empirical analysis, we identify two fundamental sources of instability in this setting: (1)~a granularity mismatch between token-level policy optimization and turn-structured interactions, and (2) high-variance and unreliable gradient updates induced by off-policy importance sampling and inaccurate advantage estimation. To address these challenges, we propose SORL, \underline\{S\}tabilizing \underline\{O\}ff-Policy \underline\{R\}einforcement \underline\{L\}earning for Long-Horizon Agent Training. SORL introduces principled mechanisms that align policy optimization with the structure of multi-turn interactions and adaptively suppress unreliable off-policy updates, yielding more conservative and rob

Stabilizing Off-policy Training For Long-horizon LLM Agent Via Turn-level Importance Sampling And Clipping-triggered Normalization

Abstract

Authors

Tags

Stats

Related papers