Co-evolution Of Policy And Internal Reward For Language Agents
2026 Β· Xinyu Wang, Hanwei Wu, Jingwei Song, et al.
Abstract
Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while joint
Authors
(none)
Tags
Stats
Related papers
- Comas: Co-evolving Multi-agent Systems Via Interaction Rewards (2025)0.00
- From Laws To Motivation: Guiding Exploration Through Law-based Reasoning And Rewards (2024)0.00
- GHPO: Adaptive Guidance For Stable And Efficient LLM Reinforcement Learning (2025)0.00
- Towards Agentic Self-learning Llms In Search Environment (2025)0.00
- Llm-explorer: A Plug-in Reinforcement Learning Policy Exploration Enhancement Driven By Large Language Models (2025)0.00
- Agentevolver: Towards Efficient Self-evolving Agent System (2025)0.00
- Rlzero: Direct Policy Inference From Language Without In-domain Supervision (2024)0.00
- Guiding Reinforcement Learning Using Uncertainty-aware Large Language Models (2024)0.00