Arbitrary Entropy Policy Optimization Breaks The Exploration Bottleneck Of Reinforcement Learning
2025 Β· Chen Wang, Zhaochun Li, Jionghao Bai, et al.
Abstract
Reinforcement Learning (RL) is essential for enhancing the reasoning capabilities of large language models (LLMs), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, causing exploration to vanish and policies to converge prematurely. As a result, RL is widely believed to be incapable of expanding the reasoning frontier of LLMs. Existing entropy-regularized methods introduce an inevitable trade-off between reward and entropy, leading to exploration accompanied by non-negligible optimization bias. In this work, we prove that temperature-guided REINFORCE can modulate policy entropy, and propose Arbitrary Entropy Policy Optimization (AEPO), which reformulates entropy regularization as a policy-gradient optimization problem. Rather than manipulating entropy directly, AEPO implicitly regulates it by applying a REINFORCE regularization term on temperature-adjusted samples, ensuring that entropy is controlled but never dominates optimization, thereb
Authors
(none)
Tags
Stats
Related papers
- EP-GRPO: Entropy-progress Aligned Group Relative Policy Optimization With Implicit Process Guidance (2026)0.00
- EPO: Entropy-regularized Policy Optimization For LLM Agents Reinforcement Learning (2025)0.00
- Agentic Entropy-balanced Policy Optimization (2025)0.00
- AEGPO: Adaptive Entropy-guided Policy Optimization For Diffusion Models (2026)0.00
- Understanding The Impact Of Entropy On Policy Optimization (2018)0.00
- A Comparative Theoretical Analysis Of Entropy Control Methods In Reinforcement Learning (2026)0.00
- Think Outside The Policy: In-context Steered Policy Optimization (2025)0.00
- Entrpo: Trust Region Policy Optimization Method With Entropy Regularization (2021)0.00