On The Convergence Rate Of Off-policy Policy Optimization Methods With Density-ratio Correction

·2021

arXiv:huang2021on ↗Google Scholar ↗Semantic Scholar ↗

Abstract

In this paper, we study the convergence properties of off-policy policy improvement algorithms with state-action density ratio correction under function approximation setting, where the objective function is formulated as a max-max-min optimization problem. We characterize the bias of the learning objective and present two strategies with finite-time convergence guarantees. In our first strategy, we present algorithm P-SREDA with convergence rate \(O(\epsilon^\{-3\})\), whose dependency on \(\epsilon\) is optimal. In our second strategy, we propose a new off-policy actor-critic style algorithm named O-SPIM. We prove that O-SPIM converges to a stationary point with total complexity \(O(\epsilon^\{-4\})\), which matches the convergence rate of some recent actor-critic algorithms in the on-policy setting.

Abstract

Related papers