Diverse Randomized Value Functions: A Provably Pessimistic Approach For Offline Reinforcement Learning

Abstract

Offline Reinforcement Learning (RL) faces distributional shift and unreliable value estimation, especially for out-of-distribution (OOD) actions. To address this, existing uncertainty-based methods penalize the value function with uncertainty quantification and demand numerous ensemble networks, posing computational challenges and suboptimal outcomes. In this paper, we introduce a novel strategy employing diverse randomized value functions to estimate the posterior distribution of \(Q\)-values. It provides robust uncertainty quantification and estimates lower confidence bounds (LCB) of \(Q\)-values. By applying moderate value penalties for OOD actions, our method fosters a provably pessimistic approach. We also emphasize on diversity within randomized value functions and enhance efficiency by introducing a diversity regularization method, reducing the requisite number of networks. These modules lead to reliable value estimation and efficient policy learning from offline data. Theoretic

Diverse Randomized Value Functions: A Provably Pessimistic Approach For Offline Reinforcement Learning

Abstract

Authors

Tags

Stats

Related papers