Phgpo: Pheromone-guided Policy Optimization For Long-horizon Tool Planning
2026 Β· Yu Li, Guangfeng Cai, Shengtian Yang, et al.
Abstract
Recent advancements in Large Language Model (LLM) agents have demonstrated strong capabilities in executing complex tasks through tool use. However, long-horizon multi-step tool planning is challenging, because the exploration space suffers from a combinatorial explosion. In this scenario, even when a correct tool-use path is found, it is usually considered an immediate reward for current training, which would not provide any reusable information for subsequent training. In this paper, we argue that historically successful trajectories contain reusable tool-transition patterns, which can be leveraged throughout the whole training process. Inspired by ant colony optimization where historically successful paths can be reflected by the pheromone, we propose Pheromone-Guided Policy Optimization (PhGPO), which learns a trajectory-based transition pattern (i.e., pheromone) from historical trajectories and then uses the learned pheromone to guide policy optimization. This learned pheromone pr
Authors
(none)
Tags
Stats
Related papers
- TGPO: Temporal Grounded Policy Optimization For Signal Temporal Logic Tasks (2025)0.00
- End-to-end Optimization Of Llm-driven Multi-agent Search Systems Via Heterogeneous-group-based Reinforcement Learning (2025)0.00
- GHPO: Adaptive Guidance For Stable And Efficient LLM Reinforcement Learning (2025)0.00
- Policy Optimization With Model-based Explorations (2018)5.84
- Pairwise Proximal Policy Optimization: Harnessing Relative Feedback For LLM Alignment (2023)0.00
- Think Outside The Policy: In-context Steered Policy Optimization (2025)0.00
- Improved Exploration Through Latent Trajectory Optimization In Deep Deterministic Policy Gradient (2019)0.00
- PTR-PPO: Proximal Policy Optimization With Prioritized Trajectory Replay (2021)0.00