Pessimistic Q-learning For Offline Reinforcement Learning: Towards Optimal Sample Complexity
2022 Β· Laixi Shi, Gen Li, Yuting Wei, et al.
Abstract
Offline or batch reinforcement learning seeks to learn a near-optimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of model-based algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their model-free counterparts -- which do not require explicit model estimation -- have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes, and characterize its sample complexity under the single-policy concentrability assumption which does not require the full coverage of the state-action space. In addition, a variance-reduced pessimistic Q-learning algorithm is proposed
Authors
(none)
Tags
Stats
Related papers
- Is Pessimism Provably Efficient For Offline RL? (2020)0.00
- Distributionally Robust Model-based Offline Reinforcement Learning With Near-optimal Sample Complexity (2022)0.00
- Near-optimal Offline Reinforcement Learning With Linear Representation: Leveraging Variance Information With Pessimism (2022)0.00
- Pessimism In The Face Of Confounders: Provably Efficient Offline Reinforcement Learning In Partially Observable Markov Decision Processes (2022)0.00
- State-aware Proximal Pessimistic Algorithms For Offline Reinforcement Learning (2022)0.00
- POPO: Pessimistic Offline Policy Optimization (2020)5.24
- Mildly Conservative Q-learning For Offline Reinforcement Learning (2022)0.00
- Double Pessimism Is Provably Efficient For Distributionally Robust Offline Reinforcement Learning: Generic Algorithm And Robust Partial Coverage (2023)0.00