Abstract

Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) -- the problem of evaluating a new policy using the historical data obtained by different behavior policies -- under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon \(H\). To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of $\( \frac\{1\}\{n\} \sum\nolimits_\{t=1\}^H\mathbb\{E\}_\{\mu\}\left[\frac\{d_t^\pi(s_t)^2\}\{d_t^\mu(s_t)^2\} \mathrm\{Var\}_\{\mu\}\left[\frac\{\pi_t(a_t|s_t)\}\{\mu_t(a_t|s_t)\}\big( V_\{t+1\}^\pi(s_\{t+1\}) + r_t\big) \middle| s_t\right]\right] + \ti

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyxie2019towards

Related papers