Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization
2024 Β· Yihan Du, Anna Winnicki, Gal Dalal, et al.
Abstract
Reinforcement Learning from Human Feedback (RLHF) has achieved impressive empirical successes while relying on a small amount of human feedback. However, there is limited theoretical justification for this phenomenon. Additionally, most recent studies focus on value-based algorithms despite the recent empirical successes of policy-based algorithms. In this work, we consider an RLHF algorithm based on policy optimization (PO-RLHF). The algorithm is based on the popular Policy Cover-Policy Gradient (PC-PG) algorithm, which assumes knowledge of the reward function. In PO-RLHF, knowledge of the reward function is not assumed, and the algorithm uses trajectory-based comparison feedback to infer the reward function. We provide performance bounds for PO-RLHF with low query complexity, which provides insight into why a small amount of human feedback may be sufficient to achieve good performance with RLHF. A key novelty is a trajectory-level elliptical potential analysis, which bounds the rewar
Authors
(none)
Tags
Stats
Related papers
- Can RLHF Be More Efficient With Imperfect Reward Models? A Policy Coverage Perspective (2025)0.00
- Provably Efficient Exploration In Policy Optimization (2019)0.00
- Data-dependent Exploration For Online Reinforcement Learning From Human Feedback (2026)0.00
- Policy Filtration For RLHF To Mitigate Noise In Reward Models (2024)0.00
- Reward Model Learning Vs. Direct Policy Optimization: A Comparative Analysis Of Learning From Human Preferences (2024)0.00
- Think Outside The Policy: In-context Steered Policy Optimization (2025)0.00
- Conservative Exploration For Policy Optimization Via Off-policy Policy Evaluation (2023)0.00
- Zeroth-order Policy Gradient For Reinforcement Learning From Human Feedback Without Reward Inference (2024)0.00