Robust Batch Policy Learning In Markov Decision Processes
2020 Β· Zhengling Qi, Peng Liao
Abstract
We study the offline data-driven sequential decision making problem in the framework of Markov decision process (MDP). In order to enhance the generalizability and adaptivity of the learned policy, we propose to evaluate each policy by a set of the average rewards with respect to distributions centered at the policy induced stationary distribution. Given a pre-collected dataset of multiple trajectories generated by some behavior policy, our goal is to learn a robust policy in a pre-specified policy class that can maximize the smallest value of this set. Leveraging the theory of semi-parametric statistics, we develop a statistically efficient policy learning method for estimating the de ned robust optimal policy. A rate-optimal regret bound up to a logarithmic factor is established in terms of total decision points in the dataset.
Authors
(none)
Tags
Stats
Related papers
- Batch Policy Learning In Average Reward Markov Decision Processes (2020)0.00
- Robust Anytime Learning Of Markov Decision Processes (2022)0.00
- Policy Learning For Robust Markov Decision Process With A Mismatched Generative Model (2022)0.00
- Solving Robust Mdps Through No-regret Dynamics (2023)0.00
- An Offline Risk-aware Policy Selection Method For Bayesian Markov Decision Processes (2021)0.00
- Efficient Policy Learning For Non-stationary Mdps Under Adversarial Manipulation (2019)0.00
- Sample Complexity Of Offline Distributionally Robust Linear Markov Decision Processes (2024)0.00
- Dynamic Regret Of Online Markov Decision Processes (2022)0.00