Tbq(\(\sigma\)): Improving Efficiency Of Trace Utilization For Off-policy Reinforcement Learning

Abstract

Off-policy reinforcement learning with eligibility traces is challenging because of the discrepancy between target policy and behavior policy. One common approach is to measure the difference between two policies in a probabilistic way, such as importance sampling and tree-backup. However, existing off-policy learning methods based on probabilistic policy measurement are inefficient when utilizing traces under a greedy target policy, which is ineffective for control problems. The traces are cut immediately when a non-greedy action is taken, which may lose the advantage of eligibility traces and slow down the learning process. Alternatively, some non-probabilistic measurement methods such as General Q(\(\lambda\)) and Naive Q(\(\lambda\)) never cut traces, but face convergence problems in practice. To address the above issues, this paper introduces a new method named TBQ(\(\sigma\)), which effectively unifies the tree-backup algorithm and Naive Q(\(\lambda\)). By introducing a new param

Tbq(\(\sigma\)): Improving Efficiency Of Trace Utilization For Off-policy Reinforcement Learning

Abstract

Authors

Tags

Stats

Related papers