Optimal Single-policy Sample Complexity And Transient Coverage For Average-reward Offline RL
2025 Β· Matthew Zurek, Guy Zamir, Yudong Chen
Abstract
We study offline reinforcement learning in average-reward MDPs, which presents increased challenges from the perspectives of distribution shift and non-uniform coverage, and has been relatively underexamined from a theoretical perspective. While previous work obtains performance guarantees under single-policy data coverage assumptions, such guarantees utilize additional complexity measures which are uniform over all policies, such as the uniform mixing time. We develop sharp guarantees depending only on the target policy, specifically the bias span and a novel policy hitting radius, yielding the first fully single-policy sample complexity bound for average-reward offline RL. We are also the first to handle general weakly communicating MDPs, contrasting restrictive structural assumptions made in prior work. To achieve this, we introduce an algorithm based on pessimistic discounted value iteration enhanced by a novel quantile clipping technique, which enables the use of a sharper empiric
Authors
(none)
Tags
Stats
Related papers
- Sample Efficient Active Algorithms For Offline Reinforcement Learning (2026)0.00
- Offline Reinforcement Learning With Realizability And Single-policy Concentrability (2022)0.00
- Bridging Distributionally Robust Learning And Offline RL: An Approach To Mitigate Distribution Shift And Partial Data Coverage (2023)0.00
- On The Sample Complexity Of Vanilla Model-based Offline Reinforcement Learning With Dependent Samples (2023)2.26
- Offline Policy Evaluation For Reinforcement Learning With Adaptively Collected Data (2023)0.00
- Sample Complexity Of Offline Distributionally Robust Linear Markov Decision Processes (2024)0.00
- Pessimistic Q-learning For Offline Reinforcement Learning: Towards Optimal Sample Complexity (2022)0.00
- Near-optimal Offline Reinforcement Learning Via Double Variance Reduction (2021)0.00