Explaining And Preventing Alignment Collapse In Iterative RLHF
2026 Β· Etienne Gauthier, Francis Bach, Michael I. Jordan
Abstract
arXiv:2605.04266v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) typically assumes a static or non-strategic reward model (RM). In iterative deployment, however, the policy generates the data on which the RM is retrained, creating a feedback loop. Building on the Stackelberg game formulation of this interaction, we derive an analytical decomposition of the policy's true optimization gradient into a standard policy gradient and a parameter-steering term that captures the policy's influence on the RM's future parameters. We show that standard iterative RLHF, which drops this steering term entirely, suffers from alignment collapse: the policy systematically exploits the RM's blind spots, producing low-quality, high-reward outputs whose feedback reinforces the very errors it exploits. To mitigate this, we propose foresighted policy optimization (FPO), a mechanism-design intervention that restores the missing steering term by regularizing the policy's param
Authors
(none)
Tags
Stats
Related papers
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00
- Provably Mitigating Overoptimization In RLHF: Your SFT Loss Is Implicitly An Adversarial Regularizer (2024)0.00
- The Alignment Ceiling: Objective Mismatch In Reinforcement Learning From Human Feedback (2023)0.00
- SAFE: Stable Alignment Finetuning With Entropy-aware Predictive Control For Reinforcement Learning From Human Feedback (RLHF) (2026)0.00
- Rethinking KL Regularization In RLHF: From Value Estimation To Gradient Optimization (2025)0.00
- Data-dependent Exploration For Online Reinforcement Learning From Human Feedback (2026)0.00
- Wasserstein Distributionally Robust Regret Optimization For Reinforcement Learning From Human Feedback (2026)0.00
- Principled Reinforcement Learning With Human Feedback From Pairwise Or \(k\)-wise Comparisons (2023)0.00