Sharp Analysis For Kl-regularized Contextual Bandits And RLHF

Abstract

Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay close to a reference policy. While the effectiveness and necessity of KL-regularization have been empirically demonstrated in various practical scenarios, current theoretical analysis of KL-regularized RLHF still obtains the same \(\mathcal\{O\}(1 / \epsilon^2)\) sample complexity as problems without KL-regularization. To understand the fundamental distinction between policy learning objectives with KL-regularization and ones without KL-regularization, we are the first to theoretically demonstrate the power of KL-regularization by providing a sharp analysis for KL-regularized contextual bandits and RLHF, revealing an \(\mathcal\{O\}(1 / \epsilon)\) sample complexity when \(\epsilon\) is sufficiently small. We further explore the role of da

Sharp Analysis For Kl-regularized Contextual Bandits And RLHF

Abstract

Authors

Tags

Stats

Related papers