Abstract
We consider the problem of learning an \(\epsilon\)-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Given access to a generative model, we achieve rate-optimal sample complexity by performing a simple, *perturbed* version of least-squares value iteration with orthogonal trigonometric polynomials as features. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our~\(\widetilde\{\mathcal\{O\}\}(\epsilon^\{-2-d/(\nu+1)\})\) sample complexity, where \(d\) is the dimension of the state-action space and \(\nu\) the order of smoothness, recovers the state-of-the-art result of discretization approaches for the special case of Lipschitz MDPs \((\nu=0)\). At the same time, for \(\nu\to\infty\), it recovers and greatly generalizes the \(\mathcal\{O\}(\epsilon^\{-2\})\) rate of low-rank MDPs, which are more amenable to regression approaches. In this sense, our result bridges the gap betwe