Abstract

Two central paradigms have emerged in the reinforcement learning (RL) community: online RL and offline RL. In the online RL setting, the agent has no prior knowledge of the environment, and must interact with it in order to find an \(\epsilon\)-optimal policy. In the offline RL setting, the learner instead has access to a fixed dataset to learn from, but is unable to otherwise interact with the environment, and must obtain the best policy it can from this offline data. Practical scenarios often motivate an intermediate setting: if we have some set of offline data and, in addition, may also interact with the environment, how can we best use the offline data to minimize the number of online interactions necessary to learn an \(\epsilon\)-optimal policy? In this work, we consider this setting, which we call the \textsf\{FineTuneRL\} setting, for MDPs with linear structure. We characterize the necessary number of online samples needed in this setting given access to some offline dataset,

Authors

(none)

Tags

  • Offline RL

Stats

Related papers