Abstract

Among the reasons hindering reinforcement learning (RL) applications to real-world problems, two factors are critical: limited data and the mismatch between the testing environment (real environment in which the policy is deployed) and the training environment (e.g., a simulator). This paper attempts to address these issues simultaneously with distributionally robust offline RL, where we learn a distributionally robust policy using historical data obtained from the source environment by optimizing against a worst-case perturbation thereof. In particular, we move beyond tabular settings and consider linear function approximation. More specifically, we consider two settings, one where the dataset is well-explored and the other where the dataset has sufficient coverage of the optimal policy. We propose two algorithms~-- one for each of the two settings~-- that achieve error bounds \(\tilde\{O\}(d^\{1/2\}/N^\{1/2\})\) and \(\tilde\{O\}(d^\{3/2\}/N^\{1/2\})\) respectively, where \(d\) is th

Authors

(none)

Tags

  • Offline RL

Stats

Related papers