Abstract

Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. However, theoretical understanding of these methods remains insufficient. Even in the episodic (time-inhomogeneous) tabular setting, the state-of-the-art theoretical result of policy-based method in \citet\{shani2020optimistic\} is only \(\tilde\{O\}(\sqrt\{S^2AH^4K\})\) where \(S\) is the number of states, \(A\) is the number of actions, \(H\) is the horizon, and \(K\) is the number of episodes, and there is a \(\sqrt\{SH\}\) gap compared with the information theoretic lower bound \(\tilde\{Ξ©\}(\sqrt\{SAH^3K\})\). To bridge such a gap, we propose a novel algorithm Reference-based Policy Optimization with Stable at Any Time guarantee (\algnameacro), which features the property "Stable at Any Time". We prove that our algorithm achieves \(\tilde\{O\}(\sqrt\{SAH^3K\} + \sqrt\{AH^4K\})\) regret. When \(S > H\), our algorithm is minimax optimal when ignoring logarithmic factors. To

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keywu2021nearly

Related papers