Reward Model Learning Vs. Direct Policy Optimization: A Comparative Analysis Of Learning From Human Preferences
2024 Β· Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban, et al.
Abstract
In this paper, we take a step towards a deeper understanding of learning from human preferences by systematically comparing the paradigm of reinforcement learning from human feedback (RLHF) with the recently proposed paradigm of direct preference optimization (DPO). We focus our attention on the class of loglinear policy parametrization and linear reward functions. In order to compare the two paradigms, we first derive minimax statistical bounds on the suboptimality gap induced by both RLHF and DPO, assuming access to an oracle that exactly solves the optimization problems. We provide a detailed discussion on the relative comparison between the two paradigms, simultaneously taking into account the sample size, policy and reward class dimensions, and the regularization temperature. Moreover, we extend our analysis to the approximate optimization setting and derive exponentially decaying convergence rates for both RLHF and DPO. Next, we analyze the setting where the ground-truth reward i
Authors
(none)
Tags
Stats
Related papers
- Understanding The Performance Gap In Preference Learning: A Dichotomy Of RLHF And DPO (2025)0.00
- A General Theoretical Paradigm To Understand Learning From Human Preferences (2023)0.00
- Zeroth-order Policy Gradient For Reinforcement Learning From Human Feedback Without Reward Inference (2024)0.00
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00
- Policy Filtration For RLHF To Mitigate Noise In Reward Models (2024)0.00
- Reveal The Mystery Of DPO: The Connection Between DPO And RL Algorithms (2025)0.00
- Provably Efficient Exploration In Policy Optimization (2019)0.00
- Multi-objective Reward And Preference Optimization: Theory And Algorithms (2025)0.00