Abstract
This paper proposes two exploration strategies based on the upper confidence bound principle to address the exploration– exploitation trade-off in deep reinforcement learning. The proposed approaches aim to reduce reliance on inefficient random exploration, thereby improving sample efficiency and performance in environments with sparse rewards. For large discrete action spaces, Ɛ-perceptual hashing upper confidence bound method aggregates similar states using perceptual hashing to reduce memory consumption and guide exploration through a combination of upper confidence bound and Ɛ-greedy strategies. For continuous action spaces, a multi-actor upper confidence bound method combines a deep deterministic policy gradient framework with upper confidence bound based actor selection to improve exploration efficiency. Experiments are conducted on Atari 2600 and MuJoCo benchmarks to evaluate the proposed methods in discrete and continuous control settings, respectively.