Primal-dual $\pi$ Learning: Sample Complexity And Sublinear Run Time For Ergodic Markov Decision Problems

Abstract

Consider the problem of approximating the optimal policy of a Markov decision process (MDP) by sampling state transitions. In contrast to existing reinforcement learning methods that are based on successive approximations to the nonlinear Bellman equation, we propose a Primal-Dual $\pi$ Learning method in light of the linear duality between the value and policy. The $\pi$ learning method is model-free and makes primal-dual updates to the policy and value vectors as new data are revealed. For infinite-horizon undiscounted Markov decision process with finite state space $S$ and finite action space $A$, the $\pi$ learning method finds an $\epsilon$-optimal policy using the following number of sample transitions $$ \tilde\{O\}( \frac\{(\tau\cdot t^*_\{mix\})^2 |S| |A| \}\{\epsilon^2\} ),$$ where $t^*_\{mix\}$ is an upper bound of mixing times across all policies and $\tau$ is a parameter characterizing the range of stationary distributions across policies. The $\pi\( le

Primal-dual \(\pi\) Learning: Sample Complexity And Sublinear Run Time For Ergodic Markov Decision Problems

Abstract

Authors

Tags

Stats

Related papers