Abstract

Reinforcement learning algorithms typically rely on the assumption that the environment dynamics and value function can be expressed in terms of a Markovian state representation. However, when state information is only partially observable, how can an agent learn such a state representation, and how can it detect when it has found one? We introduce a metric that can accomplish both objectives, without requiring access to -- or knowledge of -- an underlying, unobservable state space. Our metric, the \(\lambda\)-discrepancy, is the difference between two distinct temporal difference (TD) value estimates, each computed using TD(\(\lambda\)) with a different value of \(\lambda\). Since TD(\(\lambda\{=\}0\)) makes an implicit Markov assumption and TD(\(\lambda\{=\}1\)) does not, a discrepancy between these estimates is a potential indicator of a non-Markovian state representation. Indeed, we prove that the \(\lambda\)-discrepancy is exactly zero for all Markov decision processes and almost

Authors

(none)

Tags

  • Uncategorized

Stats

Related papers