Learn The Ropes, Then Trust The Wins: Self-imitation With Progressive Exploration For Agentic Reinforcement Learning
2025 Β· Yulei Qin, Xiaoyu Tan, Zhengbao He, et al.
Abstract
Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent's own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL, where a replay buffer stores good experience for off-policy update, by gradually steering the policy entropy across stages. Specifically, the proposed curriculum scheduling harmonizes intrinsic reward shaping and self-imitation to 1) expedite exploration via frequent tool
Authors
(none)
Tags
Stats
Related papers
- First-explore, Then Exploit: Meta-learning To Solve Hard Exploration-exploitation Trade-offs (2023)0.00
- Learning Self-imitating Diverse Policies (2018)0.00
- MULEX: Disentangling Exploitation From Exploration In Deep RL (2019)0.00
- Bootstrapping Task Spaces For Self-improvement (2025)1.57
- Exploitation Is All You Need... For Exploration (2025)0.00
- Learning Off-policy With Model-based Intrinsic Motivation For Active Online Exploration (2024)0.00
- Blending Imitation And Reinforcement Learning For Robust Policy Improvement (2023)0.00
- Towards Agentic Self-learning Llms In Search Environment (2025)0.00