Abstract

With reinforcement learning, an agent could learn complex behaviors from high-level abstractions of the task. However, exploration and reward shaping remained challenging for existing methods, especially in scenarios where the extrinsic feedback was sparse. Expert demonstrations have been investigated to solve these difficulties, but a tremendous number of high-quality demonstrations were usually required. In this work, an integrated policy gradient algorithm was proposed to boost exploration and facilitate intrinsic reward learning from only limited number of demonstrations. We achieved this by reformulating the original reward function with two additional terms, where the first term measured the Jensen-Shannon divergence between current policy and the expert, and the second term estimated the agent's uncertainty about the environment. The presented algorithm was evaluated on a range of simulated tasks with sparse extrinsic reward signals where only one single demonstrated trajectory

Authors

(none)

Tags

  • Policy Gradient
  • Exploration

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keychen2020policy

Related papers