EPO: Entropy-regularized Policy Optimization For LLM Agents Reinforcement Learning
2025 Β· Wujiang Xu, Wentian Zhao, Zhenting Wang, et al.
Abstract
Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations
Authors
(none)
Tags
Stats
Related papers
- Arbitrary Entropy Policy Optimization Breaks The Exploration Bottleneck Of Reinforcement Learning (2025)0.00
- Agentic Entropy-balanced Policy Optimization (2025)0.00
- Stabilizing Off-policy Training For Long-horizon LLM Agent Via Turn-level Importance Sampling And Clipping-triggered Normalization (2025)0.00
- Examining Policy Entropy Of Reinforcement Learning Agents For Personalization Tasks (2022)0.00
- Optimal Scheduling Of Entropy Regulariser For Continuous-time Linear-quadratic Reinforcement Learning (2022)4.52
- EP-GRPO: Entropy-progress Aligned Group Relative Policy Optimization With Implicit Process Guidance (2026)0.00
- Understanding The Impact Of Entropy On Policy Optimization (2018)0.00
- An Entropy Regularization Free Mechanism For Policy-based Reinforcement Learning (2021)0.00