Efficient Preference-based Reinforcement Learning Via Aligned Experience Estimation
2024 Β· Fengshuo Bai, Rui Zhao, Hongming Zhang, et al.
Abstract
Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering. However, a notable limitation of PbRL is its dependency on substantial human feedback. This dependency stems from the learning loop, which entails accurate reward learning compounded with value/policy learning, necessitating a considerable number of samples. To boost the learning loop, we propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques. Label smoothing reduces overfitting of the reward model by smoothing human preference labels. Additionally, we bootstrap a conservative estimate \(\widehat\{Q\}\) using well-supported state-action pairs from the current replay memory to mitigate overestimation bias and utilize it for policy learning regularization. Our experimental results across a variety of complex tasks, both in online and offline settings, demonstrate that our approach improves feedback efficiency,
Authors
(none)
Tags
Stats
Related papers
- Listwise Reward Estimation For Offline Preference-based Reinforcement Learning (2024)0.00
- Query-policy Misalignment In Preference-based Reinforcement Learning (2023)0.00
- Data Driven Reward Initialization For Preference Based Reinforcement Learning (2023)0.00
- Hindsight Priors For Reward Learning From Human Preferences (2024)0.00
- Symbol Guided Hindsight Priors For Reward Learning From Human Preferences (2022)0.00
- Ra-pbrl: Provably Efficient Risk-aware Preference-based Reinforcement Learning (2024)0.00
- Preference-based Multi-agent Reinforcement Learning: Data Coverage And Algorithmic Techniques (2024)0.00
- Tell Me Why: Training Preferences-based RL With Human Preferences And Step-level Explanations (2024)0.00