Statistical Efficiency Of Distributional Temporal Difference Learning And Freedman's Inequality In Hilbert Spaces

Abstract

Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One core task in DRL is distributional policy evaluation, which involves estimating the return distribution \(\eta^\pi\) for a given policy \(\pi\). Distributional temporal difference learning has been accordingly proposed, which extends the classic temporal difference learning (TD) in RL. In this paper, we focus on the non-asymptotic statistical rates of distributional TD. To facilitate theoretical analysis, we propose non-parametric distributional TD (NTD). For a \(\gamma\)-discounted infinite-horizon tabular Markov decision process, we show that for NTD with a generative model, we need \(\tilde\{O\}(\epsilon^\{-2\}\mu_\{\min\}^\{-1\}(1-\gamma)^\{-3\})\) interactions with the environment to achieve an \(\epsilon\)-optimal estimator with high probability, when the estimation error is measured by the \(1\)-Wasserstein. This sample complexity bound is minimax optimal up to logarithmic factors.

Statistical Efficiency Of Distributional Temporal Difference Learning And Freedman's Inequality In Hilbert Spaces

Abstract

Authors

Tags

Stats

Related papers