Abstract

In this paper, we investigate the sample complexity of policy evaluation in infinite-horizon offline reinforcement learning (also known as the off-policy evaluation problem) with linear function approximation. We identify a hard regime \(d\gamma^\{2\}>1\), where \(d\) is the dimension of the feature vector and \(\gamma\) is the discount rate. In this regime, for any \(q\in[\gamma^\{2\},1]\), we can construct a hard instance such that the smallest eigenvalue of its feature covariance matrix is \(q/d\) and it requires \(Ξ©\left(\frac\{d\}\{\gamma^\{2\}\left(q-\gamma^\{2\}\right)\epsilon^\{2\}\}\exp\left(\Theta\left(d\gamma^\{2\}\right)\right)\right)\) samples to approximate the value function up to an additive error \(\epsilon\). Note that the lower bound of the sample complexity is exponential in \(d\). If \(q=\gamma^\{2\}\), even infinite data cannot suffice. Under the low distribution shift assumption, we show that there is an algorithm that needs at most \(O\left(\max\left\\{ \frac\{\

Authors

(none)

Tags

  • Offline RL

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keychen2021infinite

Related papers