Double Pessimism Is Provably Efficient For Distributionally Robust Offline Reinforcement Learning: Generic Algorithm And Robust Partial Coverage

Abstract

In this paper, we study distributionally robust offline reinforcement learning (robust offline RL), which seeks to find an optimal policy purely from an offline dataset that can perform well in perturbed environments. In specific, we propose a generic algorithm framework called Doubly Pessimistic Model-based Policy Optimization (\(P^2MPO\)), which features a novel combination of a flexible model estimation subroutine and a doubly pessimistic policy optimization step. Notably, the double pessimism principle is crucial to overcome the distributional shifts incurred by (i) the mismatch between the behavior policy and the target policies; and (ii) the perturbation of the nominal model. Under certain accuracy conditions on the model estimation subroutine, we prove that \(P^2MPO\) is sample-efficient with robust partial coverage data, which only requires the offline data to have good coverage of the distributions induced by the optimal robust policy and the perturbed models around the nomina

Double Pessimism Is Provably Efficient For Distributionally Robust Offline Reinforcement Learning: Generic Algorithm And Robust Partial Coverage

Abstract

Authors

Tags

Stats

Related papers