Data-dependent Exploration For Online Reinforcement Learning From Human Feedback
2026 Β· Zhen-Yu Zhang, Yuting Tang, Jiandong Zhang, et al.
Abstract
arXiv:2605.04477v1 Announce Type: new Abstract: Online reinforcement learning from human feedback (RLHF) has emerged as a promising paradigm for aligning large language models (LLMs) by continuously collecting new preference feedback during training. A foundational challenge in this setting is exploration, which requires algorithms that enable the LLMs to generate informative comparisons that improve sample-efficiency in online RLHF. Existing exploration strategies often derive bonuses via on-policy expectations, which are difficult to estimate reliably from the limited historical preference data available during training; as a result, the policy can prematurely down-weight under-explored regions that may contain high-value behaviors. In this paper, we propose data-dependent exploration for preference optimization (DEPO), a simple and scalable method that leverages historical data to construct an extra uncertainty bonus for high-uncertainty regions, encouraging exploration toward pote
Authors
(none)
Tags
Stats
Related papers
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00
- Diverse Exploration For Fast And Safe Policy Improvement (2018)4.52
- Wasserstein Distributionally Robust Regret Optimization For Reinforcement Learning From Human Feedback (2026)0.00
- Improving Policy Gradient By Exploring Under-appreciated Rewards (2016)0.00
- Offline Retraining For Online RL: Decoupled Policy Learning To Mitigate Exploration Bias (2023)2.56
- Learning Off-policy With Model-based Intrinsic Motivation For Active Online Exploration (2024)0.00
- Decoupled Exploration And Exploitation Policies For Sample-efficient Reinforcement Learning (2021)0.00
- Reinforcement Learning With Human Feedback: Learning Dynamic Choices Via Pessimism (2023)0.00