Optimal Uniform OPE And Model-based Offline Reinforcement Learning In Time-homogeneous, Reward-free And Task-agnostic Settings

Abstract

This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards optimal learning for several well-motivated offline tasks. Uniform OPE \(\sup_\Pi|Q^\pi-\hat\{Q\}^\pi|<\epsilon\) is a stronger measure than the point-wise OPE and ensures offline learning when \(\Pi\) contains all policies (the global class). In this paper, we establish an \(Ω(H^2 S/d_m\epsilon^2)\) lower bound (over model-based family) for the global uniform OPE and our main result establishes an upper bound of \(\tilde\{O\}(H^2/d_m\epsilon^2)\) for the *local* uniform convergence that applies to all *near-empirically optimal* policies for the MDPs with *stationary* transition. Here \(d_m\) is the minimal marginal state-action probability. Critically, the highlight in achieving the optimal rate \(\tilde\{O\}(H^2/d_m\epsilon^2)\) is our design of *singleton absorbing MDP*, which is a new sharp

Optimal Uniform OPE And Model-based Offline Reinforcement Learning In Time-homogeneous, Reward-free And Task-agnostic Settings

Abstract

Authors

Tags

Stats

Related papers