Is Pessimism Provably Efficient For Offline RL?
2020 Β· Ying Jin, Zhuoran Yang, Zhaoran Wang
Abstract
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. Due to the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the dataset, which eludes most existing theoretical analysis. In this paper, we propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximators. Without assuming the sufficient coverage of the dataset, we establish a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the information-theoretic lower bound up to multiplicative factors of the dimension and horizon. In other words, pessimism is
Authors
(none)
Tags
Stats
Related papers
- Bellman-consistent Pessimism For Offline Reinforcement Learning (2021)0.00
- State-aware Proximal Pessimistic Algorithms For Offline Reinforcement Learning (2022)0.00
- Double Pessimism Is Provably Efficient For Distributionally Robust Offline Reinforcement Learning: Generic Algorithm And Robust Partial Coverage (2023)0.00
- Pessimism In The Face Of Confounders: Provably Efficient Offline Reinforcement Learning In Partially Observable Markov Decision Processes (2022)0.00
- Near-optimal Offline Reinforcement Learning With Linear Representation: Leveraging Variance Information With Pessimism (2022)0.00
- POPO: Pessimistic Offline Policy Optimization (2020)5.24
- Pessimistic Nonlinear Least-squares Value Iteration For Offline Reinforcement Learning (2023)0.00
- Model-based Offline Reinforcement Learning With Pessimism-modulated Dynamics Belief (2022)0.00