Abstract

In various control task domains, existing controllers provide a baseline level of performance that -- though possibly suboptimal -- should be maintained. Reinforcement learning (RL) algorithms that rely on extensive exploration of the state and action space can be used to optimize a control policy. However, fully exploratory RL algorithms may decrease performance below a baseline level during training. In this paper, we address the issue of online optimization of a control policy while minimizing regret w.r.t a baseline policy performance. We present a joint imitation-reinforcement learning framework, denoted JIRL. The learning process in JIRL assumes the availability of a baseline policy and is designed with two objectives in mind \textbf\{(a)\} leveraging the baseline's online demonstrations to minimize the regret w.r.t the baseline policy during training, and \textbf\{(b)\} eventually surpassing the baseline performance. JIRL addresses these objectives by initially learning to imita

Authors

(none)

Tags

  • Exploration

Stats

  • citations5
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score5.84
  • arxiv keydey2022a

Related papers