Abstract

Recently, a new multi-step temporal learning algorithm, called \(Q(\sigma)\), unifies \(n\)-step Tree-Backup (when \(\sigma=0\)) and \(n\)-step Sarsa (when \(\sigma=1\)) by introducing a sampling parameter \(\sigma\). However, similar to other multi-step temporal-difference learning algorithms, \(Q(\sigma)\) needs much memory consumption and computation time. Eligibility trace is an important mechanism to transform the off-line updates into efficient on-line ones which consume less memory and computation time. In this paper, we further develop the original \(Q(\sigma)\), combine it with eligibility traces and propose a new algorithm, called \(Q(\sigma ,\lambda)\), in which \(\lambda\) is trace-decay parameter. This idea unifies Sarsa\((\lambda)\) (when \(\sigma =1\)) and \(Q^\{\pi\}(\lambda)\) (when \(\sigma =0\)). Furthermore, we give an upper error bound of \(Q(\sigma ,\lambda)\) policy evaluation algorithm. We prove that \(Q(\sigma,\lambda)\) control algorithm can converge to the op

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations7
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score6.77
  • arxiv keyyang2018a

Related papers