Towards Optimal Off-policy Evaluation For Reinforcement Learning With Marginalized Importance Sampling

Abstract

Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) -- the problem of evaluating a new policy using the historical data obtained by different behavior policies -- under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon $H$. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of $\( \frac\{1\}\{n\} \sum\nolimits_\{t=1\}^H\mathbb\{E\}_\{\mu\}\left[\frac\{d_t^\pi(s_t)^2\}\{d_t^\mu(s_t)^2\} \mathrm\{Var\}_\{\mu\}\left[\frac\{\pi_t(a_t|s_t)\}\{\mu_t(a_t|s_t)\}\big( V_\{t+1\}^\pi(s_\{t+1\}) + r_t\big) \middle| s_t\right]\right] + \ti

Towards Optimal Off-policy Evaluation For Reinforcement Learning With Marginalized Importance Sampling

Abstract

Authors

Tags

Stats

Related papers