Policy Mirror Descent Inherently Explores Action Space

Abstract

Explicit exploration in the action space was assumed to be indispensable for online policy gradient methods to avoid a drastic degradation in sample complexity, for solving general reinforcement learning problems over finite state and action spaces. In this paper, we establish for the first time an \(\tilde\{\mathcal\{O\}\}(1/\epsilon^2)\) sample complexity for online policy gradient methods without incorporating any exploration strategies. The essential development consists of two new on-policy evaluation operators and a novel analysis of the stochastic policy mirror descent method (SPMD). SPMD with the first evaluation operator, called value-based estimation, tailors to the Kullback-Leibler divergence. Provided the Markov chains on the state space of generated policies are uniformly mixing with non-diminishing minimal visitation measure, an \(\tilde\{\mathcal\{O\}\}(1/\epsilon^2)\) sample complexity is obtained with a linear dependence on the size of the action space. SPMD with the s

Policy Mirror Descent Inherently Explores Action Space

Abstract

Authors

Tags

Stats

Related papers