Infinite-horizon Offline Reinforcement Learning With Linear Function Approximation: Curse Of Dimensionality And Algorithm

Abstract

In this paper, we investigate the sample complexity of policy evaluation in infinite-horizon offline reinforcement learning (also known as the off-policy evaluation problem) with linear function approximation. We identify a hard regime \(d\gamma^\{2\}>1\), where \(d\) is the dimension of the feature vector and \(\gamma\) is the discount rate. In this regime, for any \(q\in[\gamma^\{2\},1]\), we can construct a hard instance such that the smallest eigenvalue of its feature covariance matrix is \(q/d\) and it requires \(Ω\left(\frac\{d\}\{\gamma^\{2\}\left(q-\gamma^\{2\}\right)\epsilon^\{2\}\}\exp\left(\Theta\left(d\gamma^\{2\}\right)\right)\right)\) samples to approximate the value function up to an additive error \(\epsilon\). Note that the lower bound of the sample complexity is exponential in \(d\). If \(q=\gamma^\{2\}\), even infinite data cannot suffice. Under the low distribution shift assumption, we show that there is an algorithm that needs at most \(O\left(\max\left\\{ \frac\{\

Infinite-horizon Offline Reinforcement Learning With Linear Function Approximation: Curse Of Dimensionality And Algorithm

Abstract

Authors

Tags

Stats

Related papers