Abstract

This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards optimal learning for several well-motivated offline tasks. Uniform OPE \(\sup_\Pi|Q^\pi-\hat\{Q\}^\pi|<\epsilon\) is a stronger measure than the point-wise OPE and ensures offline learning when \(\Pi\) contains all policies (the global class). In this paper, we establish an \(Ξ©(H^2 S/d_m\epsilon^2)\) lower bound (over model-based family) for the global uniform OPE and our main result establishes an upper bound of \(\tilde\{O\}(H^2/d_m\epsilon^2)\) for the *local* uniform convergence that applies to all *near-empirically optimal* policies for the MDPs with *stationary* transition. Here \(d_m\) is the minimal marginal state-action probability. Critically, the highlight in achieving the optimal rate \(\tilde\{O\}(H^2/d_m\epsilon^2)\) is our design of *singleton absorbing MDP*, which is a new sharp

Authors

(none)

Tags

  • Model-Based RL
  • Offline RL

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyyin2021optimal

Related papers