Hindsight Priors For Reward Learning From Human Preferences
2024 Β· Mudit Verma, Katherine Metcalf
Abstract
Preference based Reinforcement Learning (PbRL) removes the need to hand specify a reward function by learning a reward from preference feedback over policy behaviors. Current approaches to PbRL do not address the credit assignment problem inherent in determining which parts of a behavior most contributed to a preference, which result in data intensive approaches and subpar reward functions. We address such limitations by introducing a credit assignment strategy (Hindsight PRIOR) that uses a world model to approximate state importance within a trajectory and then guides rewards to be proportional to state importance through an auxiliary predicted return redistribution objective. Incorporating state importance into reward learning improves the speed of policy learning, overall policy performance, and reward recovery on both locomotion and manipulation tasks. For example, Hindsight PRIOR recovers on average significantly (p<0.05) more reward on MetaWorld (20%) and DMC (15%). The performan
Authors
(none)
Tags
Stats
Related papers
- Symbol Guided Hindsight Priors For Reward Learning From Human Preferences (2022)0.00
- Data Driven Reward Initialization For Preference Based Reinforcement Learning (2023)0.00
- Ra-pbrl: Provably Efficient Risk-aware Preference-based Reinforcement Learning (2024)0.00
- Listwise Reward Estimation For Offline Preference-based Reinforcement Learning (2024)0.00
- Dueling RL: Reinforcement Learning With Trajectory Preferences (2021)0.00
- Reinforcement Learning From Diverse Human Preferences (2023)0.00
- Efficient Preference-based Reinforcement Learning Via Aligned Experience Estimation (2024)0.00
- Query-policy Misalignment In Preference-based Reinforcement Learning (2023)0.00