GHPO: Adaptive Guidance For Stable And Efficient LLM Reinforcement Learning
2025 Β· Ziru Liu, Cheng Gong, Xinyu Fu, et al.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of large language models (LLMs), particularly in the domain of complex reasoning tasks. However, prevailing on-policy RL methods often contend with significant training instability and inefficiency. This is primarily due to a capacity-difficulty mismatch, where the complexity of training data frequently outpaces the model's current capabilities, leading to critically sparse reward signals and stalled learning progress. This challenge is particularly acute for smaller, more resource-efficient LLMs. To overcome this, we introduce the Guided Hybrid Policy Optimization (GHPO), a novel difficulty-aware reinforcement learning framework. GHPO dynamically calibrates task difficulty by employing adaptive prompt refinement to provide targeted guidance. This unique approach adaptively balances direct imitation learning for problems currently beyond the model's re
Authors
(none)
Tags
Stats
Related papers
- Guided Policy Optimization Under Partial Observability (2025)0.00
- Pairwise Proximal Policy Optimization: Harnessing Relative Feedback For LLM Alignment (2023)0.00
- Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning In GRPO (2025)0.00
- Think Outside The Policy: In-context Steered Policy Optimization (2025)0.00
- Adapt To Thrive! Adaptive Power-mean Policy Optimization For Improved LLM Reasoning (2026)0.00
- EP-GRPO: Entropy-progress Aligned Group Relative Policy Optimization With Implicit Process Guidance (2026)0.00
- Remax: A Simple, Effective, And Efficient Reinforcement Learning Method For Aligning Large Language Models (2023)0.00
- OBLR-PO: A Theoretical Framework For Stable Reinforcement Learning (2025)0.00