UVIP: Model-free Approach To Evaluate Reinforcement Learning Algorithms

·2021

arXiv:belomestny2021uvip ↗Google Scholar ↗Semantic Scholar ↗

Abstract

Policy evaluation is an important instrument for the comparison of different algorithms in Reinforcement Learning (RL). However, even a precise knowledge of the value function \(V^\{\pi\}\) corresponding to a policy \(\pi\) does not provide reliable information on how far the policy \(\pi\) is from the optimal one. We present a novel model-free upper value iteration procedure (\{\sf UVIP\}) that allows us to estimate the suboptimality gap \(V^\{\star\}(x) - V^\{\pi\}(x)\) from above and to construct confidence intervals for \(V^\star\). Our approach relies on upper bounds to the solution of the Bellman optimality equation via the martingale approach. We provide theoretical guarantees for \{\sf UVIP\} under general assumptions and illustrate its performance on a number of benchmark RL problems.

Abstract

Related papers