Dueling Posterior Sampling For Preference-based Reinforcement Learning
2019 Β· Ellen R. Novoseller, Yibing Wei, Yanan Sui, et al.
Abstract
In preference-based reinforcement learning (RL), an agent interacts with the environment while receiving preferences instead of absolute feedback. While there is increasing research activity in preference-based RL, the design of formal frameworks that admit tractable theoretical analysis remains an open challenge. Building upon ideas from preference-based bandit learning and posterior sampling in RL, we present DUELING POSTERIOR SAMPLING (DPS), which employs preference-based posterior sampling to learn both the system dynamics and the underlying utility function that governs the preference feedback. As preference feedback is provided on trajectories rather than individual state-action pairs, we develop a Bayesian approach for the credit assignment problem, translating preferences to a posterior distribution over state-action reward models. We prove an asymptotic Bayesian no-regret rate for DPS with a Bayesian linear regression credit assignment model. This is the first regret guarantee
Authors
(none)
Tags
Stats
Related papers
- Dueling RL: Reinforcement Learning With Trajectory Preferences (2021)0.00
- Model-based Reinforcement Learning For Continuous Control With Posterior Sampling (2020)0.00
- Posterior Sampling For Large Scale Reinforcement Learning (2017)0.00
- Posterior Sampling For Continuing Environments (2022)0.00
- Posterior Sampling With Delayed Feedback For Reinforcement Learning With Linear Function Approximation (2023)0.00
- Why Is Posterior Sampling Better Than Optimism For Reinforcement Learning? (2016)0.00
- Reinforcement Learning From Diverse Human Preferences (2023)0.00
- Hindsight Priors For Reward Learning From Human Preferences (2024)0.00