Wasserstein Distributionally Robust Regret Optimization For Reinforcement Learning From Human Feedback
2026 Β· Yikai Wang, Shang Liu, Jose Blanchet
Abstract
arXiv:2605.00155v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting gap leads to reward over-optimization, or Goodharting, where proxy reward continues to improve even after true quality deteriorates. Existing mitigations address this problem through uncertainty penalties, pessimistic rewards, or conservative constraints, but they can be computationally burdensome and overly pessimistic. We propose Wasserstein distributionally robust regret optimization (DRRO) for RLHF. Instead of pessimizing worst-case value as in standard DRO, DRRO pessimizes worst-case regret
Authors
(none)
Tags
Stats
Related papers
- Provably Mitigating Overoptimization In RLHF: Your SFT Loss Is Implicitly An Adversarial Regularizer (2024)0.00
- Data-dependent Exploration For Online Reinforcement Learning From Human Feedback (2026)0.00
- Distributional Robustness And Regularization In Reinforcement Learning (2020)0.00
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00
- Reinforcement Learning With Human Feedback: Learning Dynamic Choices Via Pessimism (2023)0.00
- The Perils Of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret (2024)0.00
- Principled Reinforcement Learning With Human Feedback From Pairwise Or \(k\)-wise Comparisons (2023)0.00
- Explaining And Preventing Alignment Collapse In Iterative RLHF (2026)0.00