SAFE: Stable Alignment Finetuning With Entropy-aware Predictive Control For Reinforcement Learning From Human Feedback (RLHF)
2026 Β· Dipan Maity
Abstract
Proximal Policy Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of Reinforcement Learning from Human Feedback (RLHF). PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner and suffers form reward oscillations, entropy collapse, value function drift, and sudden policy divergence that require frequent restarts and extensive hyperparameter tuning. In this paper, we develop a new pure on policy actor-critic RL method for the LM-RLHF setting. We present SAFE (Stable Alignment Finetuning with Entropy-aware control),a novel RLHF algorithm that combines a Double Soft-Min Critic for pessimistic value estimation with a new multi-layer stabilization framework combining entropy-gated KL regulation, and PID-controlled adaptive thresholds. Unlike standard PPO's symmetric KL penalties, SAFE distinguishes high-entropy exploration from low-entropy mode collapse and adj
Authors
(none)
Tags
Stats
Related papers
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00
- Pairwise Proximal Policy Optimization: Harnessing Relative Feedback For LLM Alignment (2023)0.00
- Explaining And Preventing Alignment Collapse In Iterative RLHF (2026)0.00
- Policy Filtration For RLHF To Mitigate Noise In Reward Models (2024)0.00
- Model-based Safe Deep Reinforcement Learning Via A Constrained Proximal Policy Optimization Algorithm (2022)5.24
- Feasible Policy Iteration For Safe Reinforcement Learning (2023)0.00
- Safe Policy Optimization With Local Generalized Linear Function Approximations (2021)0.00
- Halypo: Heterogeneous-agent Lyapunov Policy Optimization For Human-robot Collaboration (2026)0.00