Double Pessimism Is Provably Efficient For Distributionally Robust Offline Reinforcement Learning: Generic Algorithm And Robust Partial Coverage
2023 Β· Jose Blanchet, Miao Lu, Tong Zhang, et al.
Abstract
In this paper, we study distributionally robust offline reinforcement learning (robust offline RL), which seeks to find an optimal policy purely from an offline dataset that can perform well in perturbed environments. In specific, we propose a generic algorithm framework called Doubly Pessimistic Model-based Policy Optimization (\(P^2MPO\)), which features a novel combination of a flexible model estimation subroutine and a doubly pessimistic policy optimization step. Notably, the double pessimism principle is crucial to overcome the distributional shifts incurred by (i) the mismatch between the behavior policy and the target policies; and (ii) the perturbation of the nominal model. Under certain accuracy conditions on the model estimation subroutine, we prove that \(P^2MPO\) is sample-efficient with robust partial coverage data, which only requires the offline data to have good coverage of the distributions induced by the optimal robust policy and the perturbed models around the nomina
Authors
(none)
Tags
Stats
Related papers
- Pessimism In The Face Of Confounders: Provably Efficient Offline Reinforcement Learning In Partially Observable Markov Decision Processes (2022)0.00
- Bridging Distributionally Robust Learning And Offline RL: An Approach To Mitigate Distribution Shift And Partial Data Coverage (2023)0.00
- Is Pessimism Provably Efficient For Offline RL? (2020)0.00
- Distributionally Robust Model-based Offline Reinforcement Learning With Near-optimal Sample Complexity (2022)0.00
- Model-based Offline Reinforcement Learning With Pessimism-modulated Dynamics Belief (2022)0.00
- Sample Complexity Of Offline Distributionally Robust Linear Markov Decision Processes (2024)0.00
- State-aware Proximal Pessimistic Algorithms For Offline Reinforcement Learning (2022)0.00
- Distributionally Robust Offline Reinforcement Learning With Linear Function Approximation (2022)0.00