Dataset Reset Policy Optimization For RLHF
2024 Β· Jonathan D. Chang, Wenhao Zhan, Owen Oertell, et al.
Abstract
Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is cov
Authors
(none)
Tags
Stats
Related papers
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00
- POPO: Pessimistic Offline Policy Optimization (2020)5.24
- Robust Offline Reinforcement Learning With Gradient Penalty And Constraint Relaxation (2022)0.00
- Federated Offline Policy Optimization With Dual Regularization (2024)3.58
- Policy Filtration For RLHF To Mitigate Noise In Reward Models (2024)0.00
- PROTO: Iterative Policy Regularized Offline-to-online Reinforcement Learning (2023)0.00
- Decoupled Prioritized Resampling For Offline RL (2023)5.84
- Offline Retraining For Online RL: Decoupled Policy Learning To Mitigate Exploration Bias (2023)2.56