Towards Optimal Off-policy Evaluation For Reinforcement Learning With Marginalized Importance Sampling
2019 Β· Tengyang Xie, Yifei Ma, Yu-Xiang Wang
Abstract
Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) -- the problem of evaluating a new policy using the historical data obtained by different behavior policies -- under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon \(H\). To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of $\( \frac\{1\}\{n\} \sum\nolimits_\{t=1\}^H\mathbb\{E\}_\{\mu\}\left[\frac\{d_t^\pi(s_t)^2\}\{d_t^\mu(s_t)^2\} \mathrm\{Var\}_\{\mu\}\left[\frac\{\pi_t(a_t|s_t)\}\{\mu_t(a_t|s_t)\}\big( V_\{t+1\}^\pi(s_\{t+1\}) + r_t\big) \middle| s_t\right]\right] + \ti
Authors
(none)
Tags
Stats
Related papers
- Scaling Marginalized Importance Sampling To High-dimensional State-spaces Via State Abstraction (2022)0.00
- Double Reinforcement Learning For Efficient Off-policy Evaluation In Markov Decision Processes (2019)0.00
- Intrinsically Efficient, Stable, And Bounded Off-policy Evaluation For Reinforcement Learning (2019)0.00
- Asymptotically Efficient Off-policy Evaluation For Tabular Reinforcement Learning (2020)0.00
- Counterfactual-augmented Importance Sampling For Semi-offline Policy Evaluation (2023)0.00
- More Efficient Off-policy Evaluation Through Regularized Targeted Learning (2019)0.00
- Kernel Metric Learning For In-sample Off-policy Evaluation Of Deterministic RL Policies (2024)0.00
- Offline Policy Evaluation For Reinforcement Learning With Adaptively Collected Data (2023)0.00