Zeroth-order Policy Gradient For Reinforcement Learning From Human Feedback Without Reward Inference
2024 Β· Qining Zhang, Lei Ying
Abstract
Reward inference (learning a reward model from human preferences) is a critical intermediate step in the Reinforcement Learning from Human Feedback (RLHF) pipeline for fine-tuning Large Language Models (LLMs). In practice, RLHF faces fundamental challenges such as distribution shift, reward model overfitting, and problem misspecification. An alternative approach is direct policy optimization without reward inference, such as Direct Preference Optimization (DPO), which provides a much simpler pipeline and has shown empirical success in LLM applications. However, DPO utilizes the closed-form expression between the optimal policy and the reward function, which is only suitable under the bandit setting or deterministic MDPs. This paper develops two RLHF algorithms without reward inference for general RL problems beyond bandits and deterministic MDPs, and general preference models beyond the Bradley-Terry model. The key idea is to estimate the local value function difference from human pref
Authors
(none)
Tags
Stats
Related papers
- Reward Model Learning Vs. Direct Policy Optimization: A Comparative Analysis Of Learning From Human Preferences (2024)0.00
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00
- A General Theoretical Paradigm To Understand Learning From Human Preferences (2023)0.00
- Rlzero: Direct Policy Inference From Language Without In-domain Supervision (2024)0.00
- Understanding The Performance Gap In Preference Learning: A Dichotomy Of RLHF And DPO (2025)0.00
- Using Human Feedback To Fine-tune Diffusion Models Without Any Reward Model (2023)17.39
- Policy Filtration For RLHF To Mitigate Noise In Reward Models (2024)0.00
- Zeroth-order Deterministic Policy Gradient (2020)0.00