← all papers · overview

3SPO: State-Score-Supervised Policy Optimization for LLM Agents

Abstract

Training large language models (LLMs) as autonomous agents via reinforcement learning (RL) has enabled frontier models to achieve superhuman performance in long-horizon tasks. However, existing RL algorithms operate at the trajectory level, performing policy optimization only after collecting complete episode rollouts. This coarse-grained approach faces fundamental challenges in multi-turn agent settings where rewards are sparse, delayed, and credit assignment across individual steps is critical. In this work, we propose \textbf{State-Score-Supervised Policy Optimization (3SPO)}, a novel RL algorithm that performs post-step policy optimization with dynamic state score supervision. At each step, 3SPO computes the state score based on historical success rates, supervising step-wise credit assignment, adaptive rollout and post-step policy optimization without requiring value function estimation or additional auxiliary models. Theoretically, under a per-state bandit abstraction, we show that the proposed score-supervised allocation mechanism achieves logarithmic allocation regret and provide sample-complexity guarantees for action identification, score distinguishability, and filtering stability. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B-Instruct show that 3SPO consistently outperforms GRPO by +22.6%+22.6\% on ALFWorld and +15.6+15.6 points on WebShop, while using comparable resources to achieve 2.4×2.4\times more state exploration and 1.8×1.8\times faster convergence. Code is available at https://github.com/genalyu/3SPO.

Code

Related papers

Ranked by semantic similarity — how closely each paper's abstract matches this one (100% = near-identical topic).