Abstract

Massive practical works addressed by Deep Q-network (DQN) algorithm have indicated that stochastic policy, despite its simplicity, is the most frequently used exploration approach. However, most existing stochastic exploration approaches either explore new actions heuristically regardless of Q-values or inevitably introduce bias into the learning process to couple the sampling with Q-values. In this paper, we propose a novel preference-guided \(\epsilon\)-greedy exploration algorithm that can efficiently learn the action distribution in line with the landscape of Q-values for DQN without introducing additional bias. Specifically, we design a dual architecture consisting of two branches, one of which is a copy of DQN, namely the Q-branch. The other branch, which we call the preference branch, learns the action preference that the DQN implicit follows. We theoretically prove that the policy improvement theorem holds for the preference-guided \(\epsilon\)-greedy policy and experimentally

Authors

(none)

Tags

  • Exploration

Stats

  • citations11
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score8.09
  • arxiv keyhuang2022sampling

Related papers