Abstract

Off-policy reinforcement learning with eligibility traces is challenging because of the discrepancy between target policy and behavior policy. One common approach is to measure the difference between two policies in a probabilistic way, such as importance sampling and tree-backup. However, existing off-policy learning methods based on probabilistic policy measurement are inefficient when utilizing traces under a greedy target policy, which is ineffective for control problems. The traces are cut immediately when a non-greedy action is taken, which may lose the advantage of eligibility traces and slow down the learning process. Alternatively, some non-probabilistic measurement methods such as General Q(\(\lambda\)) and Naive Q(\(\lambda\)) never cut traces, but face convergence problems in practice. To address the above issues, this paper introduces a new method named TBQ(\(\sigma\)), which effectively unifies the tree-backup algorithm and Naive Q(\(\lambda\)). By introducing a new param

Authors

(none)

Tags

  • Policy Gradient

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyshi2019tbq

Related papers