Abstract

We study offline reinforcement learning (RL) with linear MDPs under the infinite-horizon discounted setting which aims to learn a policy that maximizes the expected discounted cumulative reward using a pre-collected dataset. Existing algorithms for this setting either require a uniform data coverage assumptions or are computationally inefficient for finding an \(\epsilon\)-optimal policy with \(O(\epsilon^\{-2\})\) sample complexity. In this paper, we propose a primal dual algorithm for offline RL with linear MDPs in the infinite-horizon discounted setting. Our algorithm is the first computationally efficient algorithm in this setting that achieves sample complexity of \(O(\epsilon^\{-2\})\) with partial data coverage assumption. Our work is an improvement upon a recent work that requires \(O(\epsilon^\{-4\})\) samples. Moreover, we extend our algorithm to work in the offline constrained RL setting that enforces constraints on additional reward signals.

Authors

(none)

Tags

  • Offline RL

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyhong2024a

Related papers