Reinforcement Learning With Human Feedback: Learning Dynamic Choices Via Pessimism
2023 Β· Zihao Li, Zhuoran Yang, Mengdi Wang
Abstract
In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where we aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices. RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift. In this paper, we focus on the Dynamic Discrete Choice (DDC) model for modeling and understanding human choices. DCC, rooted in econometrics and decision theory, is widely used to model a human decision-making process with forward-looking and bounded rationality. We propose a \underline\{D\}ynamic-\underline\{C\}hoice-\underline\{P\}essimistic-\underline\{P\}olicy-\underline\{O\}ptimization (DCPPO) method. \ The method involves a three-stage process: The first step is to estimate the human behavior policy and the state-action value function via maximum likelihood estimation (MLE); the second step recovers
Authors
(none)
Tags
Stats
Related papers
- Model-based Offline Reinforcement Learning With Pessimism-modulated Dynamics Belief (2022)0.00
- Pessimism In The Face Of Confounders: Provably Efficient Offline Reinforcement Learning In Partially Observable Markov Decision Processes (2022)0.00
- Revisiting Design Choices In Offline Model-based Reinforcement Learning (2021)6.34
- Principled Reinforcement Learning With Human Feedback From Pairwise Or \(k\)-wise Comparisons (2023)0.00
- Data-dependent Exploration For Online Reinforcement Learning From Human Feedback (2026)0.00
- Exploration-driven Policy Optimization In RLHF: Theoretical Insights On Efficient Data Utilization (2024)0.00
- Understanding The Performance Gap In Preference Learning: A Dichotomy Of RLHF And DPO (2025)0.00
- Is Pessimism Provably Efficient For Offline RL? (2020)0.00