Abstract

In this paper, we study reinforcement learning from human feedback (RLHF) under an episodic Markov decision process with a general trajectory-wise reward model. We developed a model-free RLHF best policy identification algorithm, called \(\mathsf\{BSAD\}\), without explicit reward model inference, which is a critical intermediate step in the contemporary RLHF paradigms for training large language models (LLM). The algorithm identifies the optimal policy directly from human preference information in a backward manner, employing a dueling bandit sub-routine that constantly duels actions to identify the superior one. \(\mathsf\{BSAD\}\) adopts a reward-free exploration and best-arm-identification-like adaptive stopping criteria to equalize the visitation among all states in the same decision step while moving to the previous step as soon as the optimal action is identifiable, leading to a provable, instance-dependent sample complexity \(\tilde\{\mathcal\{O\}\}(c_\{\mathcal\{M\}\}SA^3H^3Ml

Authors

(none)

Tags

  • Model-Based RL

Stats

Related papers