Abstract

Consider the problem of approximating the optimal policy of a Markov decision process (MDP) by sampling state transitions. In contrast to existing reinforcement learning methods that are based on successive approximations to the nonlinear Bellman equation, we propose a Primal-Dual \(\pi\) Learning method in light of the linear duality between the value and policy. The \(\pi\) learning method is model-free and makes primal-dual updates to the policy and value vectors as new data are revealed. For infinite-horizon undiscounted Markov decision process with finite state space \(S\) and finite action space \(A\), the \(\pi\) learning method finds an \(\epsilon\)-optimal policy using the following number of sample transitions $\( \tilde\{O\}( \frac\{(\tau\cdot t^*_\{mix\})^2 |S| |A| \}\{\epsilon^2\} ),\)\( where \)t^*_\{mix\}\( is an upper bound of mixing times across all policies and \)\tau\( is a parameter characterizing the range of stationary distributions across policies. The \)\pi\( le

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keywang2017primal

Related papers