Policy Mirror Descent Inherently Explores Action Space
2023 Β· Yan Li, Guanghui Lan
Abstract
Explicit exploration in the action space was assumed to be indispensable for online policy gradient methods to avoid a drastic degradation in sample complexity, for solving general reinforcement learning problems over finite state and action spaces. In this paper, we establish for the first time an \(\tilde\{\mathcal\{O\}\}(1/\epsilon^2)\) sample complexity for online policy gradient methods without incorporating any exploration strategies. The essential development consists of two new on-policy evaluation operators and a novel analysis of the stochastic policy mirror descent method (SPMD). SPMD with the first evaluation operator, called value-based estimation, tailors to the Kullback-Leibler divergence. Provided the Markov chains on the state space of generated policies are uniformly mixing with non-diminishing minimal visitation measure, an \(\tilde\{\mathcal\{O\}\}(1/\epsilon^2)\) sample complexity is obtained with a linear dependence on the size of the action space. SPMD with the s
Authors
(none)
Tags
Stats
Related papers
- Policy Mirror Descent With Temporal Difference Learning: Sample Complexity Under Online Markov Data (2025)0.00
- Policy Optimization With Stochastic Mirror Descent (2019)7.50
- Mirror Descent Policy Optimisation For Robust Constrained Markov Decision Processes (2025)0.00
- Policy Gradient For Robust Markov Decision Processes (2024)0.00
- Learning Mirror Maps In Policy Mirror Descent (2024)0.00
- Optimal Convergence Rate For Exact Policy Mirror Descent In Discounted Markov Decision Processes (2023)0.00
- Low-switching Policy Gradient With Exploration Via Online Sensitivity Sampling (2023)0.00
- Policy Optimization Over General State And Action Spaces (2022)0.00