Query-policy Misalignment In Preference-based Reinforcement Learning
2023 Β· Xiao Hu, Jianxiong Li, Xianyuan Zhan, et al.
Abstract
Preference-based reinforcement learning (PbRL) provides a natural way to align RL agents' behavior with human desired outcomes, but is often restrained by costly human feedback. To improve feedback efficiency, most existing PbRL methods focus on selecting queries to maximally improve the overall quality of the reward model, but counter-intuitively, we find that this may not necessarily lead to improved performance. To unravel this mystery, we identify a long-neglected issue in the query selection schemes of existing PbRL studies: Query-Policy Misalignment. We show that the seemingly informative queries selected to improve the overall quality of reward model actually may not align with RL agents' interests, thus offering little help on policy learning and eventually resulting in poor feedback efficiency. We show that this issue can be effectively addressed via near on-policy query and a specially designed hybrid experience replay, which together enforce the bidirectional query-policy al
Authors
(none)
Tags
Stats
Related papers
- Efficient Preference-based Reinforcement Learning Via Aligned Experience Estimation (2024)0.00
- Ra-pbrl: Provably Efficient Risk-aware Preference-based Reinforcement Learning (2024)0.00
- Hindsight Priors For Reward Learning From Human Preferences (2024)0.00
- Symbol Guided Hindsight Priors For Reward Learning From Human Preferences (2022)0.00
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00
- Policy Improvement Reinforcement Learning (2026)0.00
- Preference-based Multi-agent Reinforcement Learning: Data Coverage And Algorithmic Techniques (2024)0.00
- Dueling RL: Reinforcement Learning With Trajectory Preferences (2021)0.00