A Maximum-entropy Approach To Off-policy Evaluation In Average-reward Mdps
2020 Β· Nevena Lazic, Dong Yin, Mehrdad Farajtabar, et al.
Abstract
This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e. where rewards and dynamics are linear in some known features), we provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases. In a more general setting, when the feature dynamics are approximately linear and for arbitrary rewards, we propose a new approach for estimating stationary distributions with function approximation. We formulate this problem as finding the maximum-entropy distribution subject to matching feature expectations under empirical dynamics. We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning. We demonstrate the effectiveness of the proposed OPE approaches in multiple environments.
Authors
(none)
Tags
Stats
Related papers
- A Minimax Learning Approach To Off-policy Evaluation In Confounded Partially Observable Markov Decision Processes (2021)0.00
- Minimax-optimal Off-policy Evaluation With Linear Function Approximation (2020)0.00
- Variance-aware Off-policy Evaluation With Linear Function Approximation (2021)0.00
- Double Reinforcement Learning For Efficient Off-policy Evaluation In Markov Decision Processes (2019)0.00
- Future-dependent Value-based Off-policy Evaluation In Pomdps (2022)0.00
- A Spectral Approach To Off-policy Evaluation For Pomdps (2021)0.00
- Towards Optimal Off-policy Evaluation For Reinforcement Learning With Marginalized Importance Sampling (2019)0.00
- Provably Efficient Maximum Entropy Exploration (2018)0.00