Principled Reinforcement Learning With Human Feedback From Pairwise Or \(k\)-wise Comparisons
2023 Β· Banghua Zhu, Jiantao Jiao, Michael I. Jordan
Abstract
We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the \(K\)-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and max-entropy Inverse Reinforcement Learning (IRL), and provide the first sample complexity bound
Authors
(none)
Tags
Stats
Related papers
- A Survey Of Reinforcement Learning From Human Feedback (2023)0.00
- The Alignment Ceiling: Objective Mismatch In Reinforcement Learning From Human Feedback (2023)0.00
- Online Iterative Reinforcement Learning From Human Feedback With General Preference Model (2024)0.00
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00
- Rethinking KL Regularization In RLHF: From Value Estimation To Gradient Optimization (2025)0.00
- Reinforcement Learning From Human Feedback Without Reward Inference: Model-free Algorithm And Instance-dependent Analysis (2024)0.00
- Reinforcement Learning With Human Feedback: Learning Dynamic Choices Via Pessimism (2023)0.00
- Humans Are Not Boltzmann Distributions: Challenges And Opportunities For Modelling Human Feedback And Interaction In Reinforcement Learning (2022)0.00