Optimal Uniform OPE And Model-based Offline Reinforcement Learning In Time-homogeneous, Reward-free And Task-agnostic Settings
2021 Β· Ming Yin, Yu-Xiang Wang
Abstract
This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards optimal learning for several well-motivated offline tasks. Uniform OPE \(\sup_\Pi|Q^\pi-\hat\{Q\}^\pi|<\epsilon\) is a stronger measure than the point-wise OPE and ensures offline learning when \(\Pi\) contains all policies (the global class). In this paper, we establish an \(Ξ©(H^2 S/d_m\epsilon^2)\) lower bound (over model-based family) for the global uniform OPE and our main result establishes an upper bound of \(\tilde\{O\}(H^2/d_m\epsilon^2)\) for the *local* uniform convergence that applies to all *near-empirically optimal* policies for the MDPs with *stationary* transition. Here \(d_m\) is the minimal marginal state-action probability. Critically, the highlight in achieving the optimal rate \(\tilde\{O\}(H^2/d_m\epsilon^2)\) is our design of *singleton absorbing MDP*, which is a new sharp
Authors
(none)
Tags
Stats
Related papers
- Near-optimal Provable Uniform Convergence In Offline Policy Evaluation For Reinforcement Learning (2020)0.00
- Offline Policy Evaluation For Reinforcement Learning With Adaptively Collected Data (2023)0.00
- Off-policy Evaluation In Doubly Inhomogeneous Environments (2023)7.16
- Near-optimal Offline Reinforcement Learning Via Double Variance Reduction (2021)0.00
- Nearly Horizon-free Offline Reinforcement Learning (2021)0.00
- Policy Finetuning: Bridging Sample-efficient Offline And Online Reinforcement Learning (2021)0.00
- Off-policy Evaluation In Infinite-horizon Reinforcement Learning With Latent Confounders (2020)0.00
- Offline Stochastic Shortest Path: Learning, Evaluation And Towards Optimality (2022)0.00