Towards Agentic Self-learning Llms In Search Environment
2025 Β· Wangtao Sun, Xiang Cheng, Jialin Fan, et al.
Abstract
We study whether self-learning can scale LLM-based agents without relying on human-curated datasets or predefined rule-based rewards. Through controlled experiments in a search-agent setting, we identify two key determinants of scalable agent training: the source of reward signals and the scale of agent task data. We find that rewards from a Generative Reward Model (GRM) outperform rigid rule-based signals for open-domain learning, and that co-evolving the GRM with the policy further boosts performance. Increasing the volume of agent task data-even when synthetically generated-substantially enhances agentic capabilities. Building on these insights, we propose \textbf\{Agentic Self-Learning\} (ASL), a fully closed-loop, multi-role reinforcement learning framework that unifies task generation, policy execution, and evaluation within a shared tool environment and LLM backbone. ASL coordinates a Prompt Generator, a Policy Model, and a Generative Reward Model to form a virtuous cycle of har
Authors
(none)
Tags
Stats
Related papers
- The Landscape Of Agentic Reinforcement Learning For Llms: A Survey (2025)0.00
- Agentevolver: Towards Efficient Self-evolving Agent System (2025)0.00
- LIGS: Learnable Intrinsic-reward Generation Selection For Multi-agent Learning (2021)0.00
- End-to-end Optimization Of Llm-driven Multi-agent Search Systems Via Heterogeneous-group-based Reinforcement Learning (2025)0.00
- Comas: Co-evolving Multi-agent Systems Via Interaction Rewards (2025)0.00
- SAC-GLAM: Improving Online RL For LLM Agents With Soft Actor-critic And Hindsight Relabeling (2024)0.00
- Agent-pro: Learning To Evolve Via Policy-level Reflection And Optimization (2024)9.59
- Environment Scaling For Interactive Agentic Experience Collection: A Survey (2025)0.00