Offline Policy Evaluation For Reinforcement Learning With Adaptively Collected Data
2023 Β· Sunil Madhow, Dan Qiao, Ming Yin, et al.
Abstract
Developing theoretical guarantees on the sample complexity of offline RL methods is an important step towards making data-hungry RL algorithms practically viable. Currently, most results hinge on unrealistic assumptions about the data distribution -- namely that it comprises a set of i.i.d. trajectories collected by a single logging policy. We consider a more general setting where the dataset may have been gathered adaptively. We develop theory for the TMIS Offline Policy Evaluation (OPE) estimator in this generalized setting for tabular MDPs, deriving high-probability, instance-dependent bounds on its estimation error. We also recover minimax-optimal offline learning in the adaptive setting. Finally, we conduct simulations to empirically analyze the behavior of these estimators under adaptive and non-adaptive regimes.
Authors
(none)
Tags
Stats
Related papers
- Near-optimal Provable Uniform Convergence In Offline Policy Evaluation For Reinforcement Learning (2020)0.00
- Projected State-action Balancing Weights For Offline Reinforcement Learning (2021)0.00
- Towards Data-driven Offline Simulations For Online Reinforcement Learning (2022)0.00
- Counterfactual-augmented Importance Sampling For Semi-offline Policy Evaluation (2023)0.00
- Expert-supervised Reinforcement Learning For Offline Policy Learning And Evaluation (2020)0.00
- Optimal Uniform OPE And Model-based Offline Reinforcement Learning In Time-homogeneous, Reward-free And Task-agnostic Settings (2021)0.00
- Near-optimal Offline Reinforcement Learning Via Double Variance Reduction (2021)0.00
- Towards Optimal Off-policy Evaluation For Reinforcement Learning With Marginalized Importance Sampling (2019)0.00