Turn-ppo: Turn-level Advantage Estimation With PPO For Improved Multi-turn RL In Agentic Llms
2025 Β· Junbo Li, Peng Zhou, Rui Meng, et al.
Abstract
Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.
Authors
(none)
Tags
Stats
Related papers
- Stabilizing Off-policy Training For Long-horizon LLM Agent Via Turn-level Importance Sampling And Clipping-triggered Normalization (2025)0.00
- Direct Multi-turn Preference Optimization For Language Agents (2024)3.45
- What's Behind Ppo's Collapse In Long-cot? Value Optimization Holds The Secret (2025)0.00
- A Theoretical Analysis Of Optimistic Proximal Policy Optimization In Linear Markov Decision Processes (2023)0.00
- Truly Proximal Policy Optimization (2019)0.00
- Think Outside The Policy: In-context Steered Policy Optimization (2025)0.00
- AM-PPO: (advantage) Alpha-modulation With Proximal Policy Optimization (2025)0.00
- KIPPO: Koopman-inspired Proximal Policy Optimization (2025)0.00