Abstract

Explicit exploration in the action space was assumed to be indispensable for online policy gradient methods to avoid a drastic degradation in sample complexity, for solving general reinforcement learning problems over finite state and action spaces. In this paper, we establish for the first time an \(\tilde\{\mathcal\{O\}\}(1/\epsilon^2)\) sample complexity for online policy gradient methods without incorporating any exploration strategies. The essential development consists of two new on-policy evaluation operators and a novel analysis of the stochastic policy mirror descent method (SPMD). SPMD with the first evaluation operator, called value-based estimation, tailors to the Kullback-Leibler divergence. Provided the Markov chains on the state space of generated policies are uniformly mixing with non-diminishing minimal visitation measure, an \(\tilde\{\mathcal\{O\}\}(1/\epsilon^2)\) sample complexity is obtained with a linear dependence on the size of the action space. SPMD with the s

Authors

(none)

Tags

  • Policy Gradient
  • Exploration

Stats

  • citations1
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score2.26
  • arxiv keyli2023policy

Related papers