What's Behind Ppo's Collapse In Long-cot? Value Optimization Holds The Secret
2025 Β· Yufeng Yuan, Yu Yue, Ruofei Zhu, et al.
Abstract
Reinforcement learning (RL) is pivotal for enabling large language models (LLMs) to generate long chains of thought (CoT) for complex tasks like math and reasoning. However, Proximal Policy Optimization (PPO), effective in many RL scenarios, fails in long CoT tasks. This paper identifies that value initialization bias and reward signal decay are the root causes of PPO's failure. We propose Value-Calibrated PPO (VC-PPO) to address these issues. In VC-PPO, the value model is pretrained to tackle initialization bias, and the Generalized Advantage Estimation (GAE) computation is decoupled between the actor and critic to mitigate reward signal decay. Experiments on the American Invitational Mathematics Examination (AIME) show that VC-PPO significantly boosts PPO performance. Ablation studies show that techniques in VC-PPO are essential in enhancing PPO for long CoT tasks.
Authors
(none)
Tags
Stats
Related papers
- Truly Proximal Policy Optimization (2019)0.00
- Turn-ppo: Turn-level Advantage Estimation With PPO For Improved Multi-turn RL In Agentic Llms (2025)0.00
- KIPPO: Koopman-inspired Proximal Policy Optimization (2025)0.00
- Pairwise Proximal Policy Optimization: Harnessing Relative Feedback For LLM Alignment (2023)0.00
- No Representation, No Trust: Connecting Representation, Collapse, And Trust Issues In PPO (2024)0.00
- Revisiting Design Choices In Proximal Policy Optimization (2020)0.00
- Policy Optimization With Model-based Explorations (2018)5.84
- DISPO: Enhancing Training Efficiency And Stability In Reinforcement Learning For Large Language Model Mathematical Reasoning (2026)0.00