Sharp Analysis For Kl-regularized Contextual Bandits And RLHF
2024 Β· Heyang Zhao, Chenlu Ye, Quanquan Gu, et al.
Abstract
Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay close to a reference policy. While the effectiveness and necessity of KL-regularization have been empirically demonstrated in various practical scenarios, current theoretical analysis of KL-regularized RLHF still obtains the same \(\mathcal\{O\}(1 / \epsilon^2)\) sample complexity as problems without KL-regularization. To understand the fundamental distinction between policy learning objectives with KL-regularization and ones without KL-regularization, we are the first to theoretically demonstrate the power of KL-regularization by providing a sharp analysis for KL-regularized contextual bandits and RLHF, revealing an \(\mathcal\{O\}(1 / \epsilon)\) sample complexity when \(\epsilon\) is sufficiently small. We further explore the role of da
Authors
(none)
Tags
Stats
Related papers
- Kl-regularization Itself Is Differentially Private In Bandits And RLHF (2025)0.00
- Rethinking KL Regularization In RLHF: From Value Estimation To Gradient Optimization (2025)0.00
- Leverage The Average: An Analysis Of KL Regularization In RL (2020)0.00
- Information Asymmetry In Kl-regularized RL (2019)0.00
- Regularization Matters In Policy Optimization (2019)2.68
- Principled Reinforcement Learning With Human Feedback From Pairwise Or \(k\)-wise Comparisons (2023)0.00
- Can RLHF Be More Efficient With Imperfect Reward Models? A Policy Coverage Perspective (2025)0.00
- A Kl-regularization Framework For Learning To Plan With Adaptive Priors (2025)0.00