Abstract

Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One core task in DRL is distributional policy evaluation, which involves estimating the return distribution \(\eta^\pi\) for a given policy \(\pi\). Distributional temporal difference learning has been accordingly proposed, which extends the classic temporal difference learning (TD) in RL. In this paper, we focus on the non-asymptotic statistical rates of distributional TD. To facilitate theoretical analysis, we propose non-parametric distributional TD (NTD). For a \(\gamma\)-discounted infinite-horizon tabular Markov decision process, we show that for NTD with a generative model, we need \(\tilde\{O\}(\epsilon^\{-2\}\mu_\{\min\}^\{-1\}(1-\gamma)^\{-3\})\) interactions with the environment to achieve an \(\epsilon\)-optimal estimator with high probability, when the estimation error is measured by the \(1\)-Wasserstein. This sample complexity bound is minimax optimal up to logarithmic factors.

Authors

(none)

Tags

  • Uncategorized

Stats

Related papers